Checkpointed Early Load Retirement

Authors:
Nevin Kirman;Meyrem Kirman;Mainak Chaudhuri;Jose F. Martinez
Affiliations:
Cornell University, Ithaca, NY;Cornell University, Ithaca, NY;Cornell University, Ithaca, NY;Cornell University, Ithaca, NY
Venue:
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Year:
2005

Citing 0
Cited 31

Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Address-Value Delta (AVD) Prediction: Increasing the Effectiveness of Runahead Execution by Exploiting Regular Memory Allocation Patterns

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Cherry-MP: Correctly Integrating Checkpointed Early Resource Recycling in Chip Multiprocessors

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
On the importance of optimizing the configuration of stream prefetchers

Proceedings of the 2005 workshop on Memory system performance
Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance

IEEE Micro
CAVA: Using checkpoint-assisted value prediction to hide L2 misses

ACM Transactions on Architecture and Code Optimization (TACO)
Overlapping dependent loads with addressless preload

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Long-latency branches: how much do they matter?

ACM SIGARCH Computer Architecture News
TMA: a trap-based memory architecture

Proceedings of the 20th annual international conference on Supercomputing
Future execution: A prefetching mechanism that uses multiple cores to speed up single threads

ACM Transactions on Architecture and Code Optimization (TACO)
Scalable Cache Miss Handling for High Memory-Level Parallelism

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Adaptive VP decay: making value predictors leakage-efficient designs for high performance processors

Proceedings of the 4th international conference on Computing frontiers
BulkSC: bulk enforcement of sequential consistency

Proceedings of the 34th annual international symposium on Computer architecture
An L2-miss-driven early register deallocation for SMT processors

Proceedings of the 21st annual international conference on Supercomputing
Focused prefetching: performance oriented prefetching based on commit stalls

Proceedings of the 22nd annual international conference on Supercomputing
Reducing register pressure in SMT processors through L2-miss-driven early register release

ACM Transactions on Architecture and Code Optimization (TACO)
A performance-correctness explicitly-decoupled architecture

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Combining thread level speculation helper threads and runahead execution

Proceedings of the 23rd international conference on Supercomputing
Checkpoint allocation and release

ACM Transactions on Architecture and Code Optimization (TACO)
Software-assisted hardware reliability: abstracting circuit-level challenges to the software stack

Proceedings of the 46th Annual Design Automation Conference
Eliminating voltage emergencies via software-guided code transformations

ACM Transactions on Architecture and Code Optimization (TACO)
An event-guided approach to reducing voltage noise in processors

Proceedings of the Conference on Design, Automation and Test in Europe
Leakage-efficient design of value predictors through state and non-state preserving techniques

The Journal of Supercomputing
Efficiently exploiting memory level parallelism on asymmetric coupled cores in the dark silicon era

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Leveraging Strength-Based Dynamic Information Flow Analysis to Enhance Data Value Prediction

ACM Transactions on Architecture and Code Optimization (TACO)
Achieving reliable system performance by fast recovery of branch miss prediction

Journal of Network and Computer Applications
Mixed speculative multithreaded execution models

ACM Transactions on Architecture and Code Optimization (TACO)
Improving memory scheduling via processor-side load criticality information

Proceedings of the 40th Annual International Symposium on Computer Architecture
Revisiting reorder buffer architecture for next generation high performance computing

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Long-latency loads are critical in today's processors due to the ever-increasing speed gap with memory. Not only do these loads block the execution of dependent instructions, they also prevent other instructions from moving through the in-order reorder buffer (ROB) and retire. As a result, the processor quickly fills up with uncommitted instructions, and computationultimately stalls. To attack this problem, we propose checkpointed early load retirement, a mechanism that combines register check-pointing and back-end-i.e., at retirement-load-value prediction. When a long-latency load hits the ROB head unresolved, the processor enters Clear mode by (1) taking a Checkpoint of the architectural registers, (2) supplying a Load-value prediction to consumers, and (3) EARly-retiring the long-latency load. This unclogs the ROB, thereby "clearing the way" for subsequent instructions to retire, and also allowing instructions dependent on the long-latency load to execute sooner. When the actual value returns from memory, it is compared against the prediction. A mis-prediction causes the processor to roll back to the check-point, discarding all subsequent computation. The benefits of executing in Clear mode come from providing early forward progress on correct predictions, and from warming up caches and other structures on wrong predictions. Our evaluation shows that a Clear implementation with support for four checkpoints yields an average speedup of 1.12 for both eleven integer and eight floating-point applications (1.27 and 1.19 for five integer and five floating-point memory-bound applications, respectively), relative to a contemporary out-of-order processor with an aggressive hardware prefetcher.