Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors

Authors:
Haitham Akkary;Ravi Rajwar;Srikanth T. Srinivasan
Affiliations:
Microprocessor Research Labs, Intel Corporation, Hillsboro, OR;Microprocessor Research Labs, Intel Corporation, Hillsboro, OR;Microprocessor Research Labs, Intel Corporation, Hillsboro, OR
Venue:
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Year:
2003

Citing 17
Cited 94

Checkpoint repair for out-of-order execution machines

ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Partitioned register files for VLIWs: a preliminary analysis of tradeoffs

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Assigning confidence to conditional branch predictions

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Register renaming and dynamic speculation: an alternative approach

MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models

Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
Implementation of precise interrupts in pipelined processors

ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
Two-level hierarchical register file organization for VLIW processors

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Increasing processor performance by implementing deeper pipelines

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A large, fast instruction window for tolerating cache misses

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Select-free instruction scheduling logic

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Reducing the complexity of the register file in dynamic superscalar processors

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
The MIPS R10000 Superscalar Microprocessor

IEEE Micro
Cherry: checkpointed early resource recycling in out-of-order microprocessors

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Hierarchical Scheduling Windows

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
The Alpha 21264: A 500 MHz Out-of-Order Execution Microprocessor

COMPCON '97 Proceedings of the 42nd IEEE International Computer Conference
POWER4 system microarchitecture

IBM Journal of Research and Development

A first glance at Kilo-instruction based multiprocessors

Proceedings of the 1st conference on Computing frontiers
Microarchitecture Optimizations for Exploiting Memory-Level Parallelism

Proceedings of the 31st annual international symposium on Computer architecture
Memory Ordering: A Value-Based Approach

Proceedings of the 31st annual international symposium on Computer architecture
Prophet/Critic Hybrid Branch Prediction

Proceedings of the 31st annual international symposium on Computer architecture
Physical Register Inlining

Proceedings of the 31st annual international symposium on Computer architecture
Continual flow pipelines

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Scalable selective re-execution for EDGE architectures

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Fingerprinting: bounding soft-error detection latency and bandwidth

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Efficient Resource Sharing in Concurrent Error Detecting Superscalar Microarchitectures

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Fingerprinting: Bounding Soft-Error-Detection Latency and Bandwidth

IEEE Micro
Continual Flow Pipelines: Achieving Resource-Efficient Latency Tolerance

IEEE Micro
Memory Ordering: A Value-Based Approach

IEEE Micro
Scalable Hardware Memory Disambiguation for High-ILP Processors

IEEE Micro
Toward kilo-instruction processors

ACM Transactions on Architecture and Code Optimization (TACO)
An analysis of a resource efficient checkpoint architecture

ACM Transactions on Architecture and Code Optimization (TACO)
Better Branch Prediction Through Prophet/Critic Hybrids

IEEE Micro
Scalable Load and Store Processing in Latency Tolerant Processors

Proceedings of the 32nd annual international symposium on Computer Architecture
Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization

Proceedings of the 32nd annual international symposium on Computer Architecture
Store Buffer Design in First-Level Multibanked Data Caches

Proceedings of the 32nd annual international symposium on Computer Architecture
Virtualizing Transactional Memory

Proceedings of the 32nd annual international symposium on Computer Architecture
Energy reduction in multiprocessor systems using transactional memory

ISLPED '05 Proceedings of the 2005 international symposium on Low power electronics and design
Kilo-Instruction Processors: Overcoming the Memory Wall

IEEE Micro
Fast branch misprediction recovery in out-of-order superscalar processors

Proceedings of the 19th annual international conference on Supercomputing
Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
A Simple Divide-and-Conquer Approach for Neural-Class Branch Prediction

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Memory State Compressors for Giga-Scale Checkpoint/Restore

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Load-Store Queue Management: an Energy-Efficient Design Based on a State-Filtering Mechanism.

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
How to Fake 1000 Registers

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Incremental Commit Groups for Non-Atomic Trace Processing

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Address-Indexed Memory Disambiguation and Store-to-Load Forwarding

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Cherry-MP: Correctly Integrating Checkpointed Early Resource Recycling in Chip Multiprocessors

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
"Flea-flicker" Multipass Pipelining: An Alternative to the High-Power Out-of-Order Offense

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Speculative execution for hiding memory latency

MEDEA '04 Proceedings of the 2004 workshop on MEmory performance: DEaling with Applications , systems and architecture
Scalable Load and Store Processing in Latency-Tolerant Processors

IEEE Micro
Kilo-instruction processors, runahead and prefetching

Proceedings of the 3rd conference on Computing frontiers
Slackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification

Proceedings of the 33rd annual international symposium on Computer Architecture
A Case for MLP-Aware Cache Replacement

Proceedings of the 33rd annual international symposium on Computer Architecture
Tolerating Dependences Between Large Speculative Threads Via Sub-Threads

Proceedings of the 33rd annual international symposium on Computer Architecture
CAVA: Using checkpoint-assisted value prediction to hide L2 misses

ACM Transactions on Architecture and Code Optimization (TACO)
Decomposing the load-store queue by function for power reduction and scalability

IBM Journal of Research and Development
SPARTAN: speculative avoidance of register allocations to transient values for performance and energy efficiency

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Overlapping dependent loads with addressless preload

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Early Register Deallocation Mechanisms Using Checkpointed Register Files

IEEE Transactions on Computers
Substituting associative load queue with simple hash tables in out-of-order microprocessors

Proceedings of the 2006 international symposium on Low power electronics and design
A simple speculative load control mechanism for energy saving

MEDEA '06 Proceedings of the 2006 workshop on MEmory performance: DEaling with Applications, systems and architectures
Reducing energy of virtual cache synonym lookup using bloom filters

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
BranchTap: improving performance with very few checkpoints through adaptive speculation control

Proceedings of the 20th annual international conference on Supercomputing
A scalable low power issue queue for large instruction window processors

Proceedings of the 20th annual international conference on Supercomputing
Reunion: Complexity-Effective Multicore Redundancy

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
DMDC: Delayed Memory Dependence Checking through Age-Based Filtering

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Scalable Cache Miss Handling for High Memory-Level Parallelism

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Unified microprocessor core storage

Proceedings of the 4th international conference on Computing frontiers
Hardware atomicity for reliable software speculation

Proceedings of the 34th annual international symposium on Computer architecture
Mechanisms for store-wait-free multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
BulkSC: bulk enforcement of sequential consistency

Proceedings of the 34th annual international symposium on Computer architecture
Late-binding: enabling unordered load-store queues

Proceedings of the 34th annual international symposium on Computer architecture
Ginger: control independence using tag rewriting

Proceedings of the 34th annual international symposium on Computer architecture
Transparent control independence (TCI)

Proceedings of the 34th annual international symposium on Computer architecture
Speculative optimization using hardware-monitored guarded regions for java virtual machines

Proceedings of the 3rd international conference on Virtual execution environments
An L2-miss-driven early register deallocation for SMT processors

Proceedings of the 21st annual international conference on Supercomputing
Building a large instruction window through ROB compression

MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
Energy saving through a simple load control mechanism

ACM SIGARCH Computer Architecture News
Hiding the misprediction penalty of a resource-efficient high-performance processor

ACM Transactions on Architecture and Code Optimization (TACO)
A Two-Level Load/Store Queue Based on Execution Locality

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Reducing register pressure in SMT processors through L2-miss-driven early register release

ACM Transactions on Architecture and Code Optimization (TACO)
On the potential of latency tolerant execution in speculative multithreading

IFMT '08 Proceedings of the 1st international forum on Next-generation multicore/manycore technologies
A performance-correctness explicitly-decoupled architecture

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
A distributed processor state management architecture for large-window processors

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Decoupled store completion/silent deterministic replay: enabling scalable data memory for CPR/CFP processors

Proceedings of the 36th annual international symposium on Computer architecture
Simultaneous speculative threading: a novel pipeline architecture implemented in sun's rock processor

Proceedings of the 36th annual international symposium on Computer architecture
An energy-efficient checkpointing mechanism for out of order commit processor

Proceedings of the 14th ACM/IEEE international symposium on Low power electronics and design
Design and optimization of the store vectors memory dependence predictor

ACM Transactions on Architecture and Code Optimization (TACO)
Software-assisted hardware reliability: abstracting circuit-level challenges to the software stack

Proceedings of the 46th Annual Design Automation Conference
Folding active list for high performance and low power

ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
Turbo-ROB: a low cost checkpoint/restore accelerator

HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
A power-aware hybrid RAM-CAM renaming mechanism for fast recovery

ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
Relax: an architectural framework for software recovery of hardware faults

Proceedings of the 37th annual international symposium on Computer architecture
On the latency and energy of checkpointed superscalar register alias tables

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Speculative-aware execution: a simple and efficient technique for utilizing multi-cores to improve single-thread performance

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
An event-guided approach to reducing voltage noise in processors

Proceedings of the Conference on Design, Automation and Test in Europe
CROB: implementing a large instruction window through compression

Transactions on high-performance embedded architectures and compilers III
CRIB: consolidated rename, issue, and bypass

Proceedings of the 38th annual international symposium on Computer architecture
Efficient system-on-chip energy management with a segmented bloom filter

ARCS'06 Proceedings of the 19th international conference on Architecture of Computing Systems
Non-uniform instruction scheduling

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Idempotent processor architecture

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Achieving reliable system performance by fast recovery of branch miss prediction

Journal of Network and Computer Applications
Complexity-Effective rename table design for rapid speculation recovery

ARCS'10 Proceedings of the 23rd international conference on Architecture of Computing Systems
Disjoint out-of-order execution processor

ACM Transactions on Architecture and Code Optimization (TACO)
Virtual register renaming

ARCS'13 Proceedings of the 26th international conference on Architecture of Computing Systems
Tuning the continual flow pipeline architecture

Proceedings of the 27th international ACM conference on International conference on supercomputing
Virtual register renaming: energy efficient substrate for continual flow pipelines

Proceedings of the 23rd ACM international conference on Great lakes symposium on VLSI
Enhancing NBTI recovery in SRAM arrays through recovery boosting

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Tuning the continual flow pipeline architecture with virtual register renaming

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large instruction window processors achieve high performance by exposing large amounts of instruction levelparallelism. However, accessing large hardware structurestypically required to buffer and process such instructionwindow sizes significantly degrade the cycle time. This paper proposes a novel Checkpoint Processing and Recovery(CPR) microarchitecture, and shows how to implement alarge instruction window processor without requiring largestructures thus permitting a high clock frequency.We focus on four critical aspects of a microarchitecture:1) scheduling instructions, 2) recovering from branch mispredicts, 3) buffering a large number of stores and forwarding data from stores to any dependent load, and 4) reclaiming physical registers. While scheduling window size isimportant, we show the performance of large instructionwindows to be more sensitive to the other three design issues. Our CPR proposal incorporates novel microarchitectural schemes for addressing these design issues-a selective checkpoint mechanism for recovering from mispredicts,a hierarchical store queue organization for fast store-loadforwarding, and an effective algorithm for aggressive physical register reclamation. Our proposals allow a processor to realize performance gains due to instruction windowsof thousands of instructions without requiring large cycle-critical hardware structures.