Checkpoint repair for out-of-order execution machines
ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Partitioned register files for VLIWs: a preliminary analysis of tradeoffs
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Assigning confidence to conditional branch predictions
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Register renaming and dynamic speculation: an alternative approach
MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Improving data cache performance by pre-executing instructions under a cache miss
ICS '97 Proceedings of the 11th international conference on Supercomputing
Complexity-effective superscalar processors
Proceedings of the 24th annual international symposium on Computer architecture
Streamlining inter-operation memory communication via data dependence prediction
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Implementation of precise interrupts in pipelined processors
ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
Multiple-banked register file architectures
Proceedings of the 27th annual international symposium on Computer architecture
Space/time trade-offs in hash coding with allowable errors
Communications of the ACM
Increasing processor performance by implementing deeper pipelines
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A large, fast instruction window for tolerating cache misses
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Select-free instruction scheduling logic
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Reducing the complexity of the register file in dynamic superscalar processors
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
The MIPS R10000 Superscalar Microprocessor
IEEE Micro
Cherry: checkpointed early resource recycling in out-of-order microprocessors
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
The Alpha 21264: A 500 MHz Out-of-Order Execution Microprocessor
COMPCON '97 Proceedings of the 42nd IEEE International Computer Conference
Checkpointing alternatives for high performance, power-aware processors
Proceedings of the 2003 international symposium on Low power electronics and design
Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Out-of-Order Commit Processors
HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
POWER4 system microarchitecture
IBM Journal of Research and Development
Fast branch misprediction recovery in out-of-order superscalar processors
Proceedings of the 19th annual international conference on Supercomputing
BranchTap: improving performance with very few checkpoints through adaptive speculation control
Proceedings of the 20th annual international conference on Supercomputing
Transient fault prediction based on anomalies in processor events
Proceedings of the conference on Design, automation and test in Europe
On the latency, energy and area of checkpointed, superscalar register alias tables
ISLPED '07 Proceedings of the 2007 international symposium on Low power electronics and design
Hiding the misprediction penalty of a resource-efficient high-performance processor
ACM Transactions on Architecture and Code Optimization (TACO)
A physical level study and optimization of CAM-based checkpointed register alias table
Proceedings of the 13th international symposium on Low power electronics and design
Reexecution and Selective Reuse in Checkpoint Processors
Transactions on High-Performance Embedded Architectures and Compilers II
Checkpoint allocation and release
ACM Transactions on Architecture and Code Optimization (TACO)
Turbo-ROB: a low cost checkpoint/restore accelerator
HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
A power-aware hybrid RAM-CAM renaming mechanism for fast recovery
ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
On the latency and energy of checkpointed superscalar register alias tables
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Achieving reliable system performance by fast recovery of branch miss prediction
Journal of Network and Computer Applications
ARCS'13 Proceedings of the 26th international conference on Architecture of Computing Systems
Tuning the continual flow pipeline architecture with virtual register renaming
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
Large instruction window processors achieve high performance by exposing large amounts of instruction level parallelism. However, accessing large hardware structures typically required to buffer and process such instruction window sizes significantly degrade the cycle time. This paper proposes a novel checkpoint processing and recovery (CPR) microarchitecture, and shows how to implement a large instruction window processor without requiring large structures thus permitting a high clock frequency.We focus on four critical aspects of a microarchitecture: (1) scheduling instructions, (2) recovering from branch mispredicts, (3) buffering a large number of stores and forwarding data from stores to any dependent load, and (4) reclaiming physical registers. While scheduling window size is important, we show the performance of large instruction windows to be more sensitive to the other three design issues. Our CPR proposal incorporates novel microarchitectural schemes for addressing these design issues---a selective checkpoint mechanism for recovering from mispredicts, a hierarchical store queue organization for fast store-load forwarding, and an effective algorithm for aggressive physical register reclamation. Our proposals allow a processor to realize performance gains due to instruction windows of thousands of instructions without requiring large cycle-critical hardware structures.