Checkpointed Early Load Retirement

  • Authors:
  • Nevin Kirman;Meyrem Kirman;Mainak Chaudhuri;Jose F. Martinez

  • Affiliations:
  • Cornell University, Ithaca, NY;Cornell University, Ithaca, NY;Cornell University, Ithaca, NY;Cornell University, Ithaca, NY

  • Venue:
  • HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Long-latency loads are critical in today's processors due to the ever-increasing speed gap with memory. Not only do these loads block the execution of dependent instructions, they also prevent other instructions from moving through the in-order reorder buffer (ROB) and retire. As a result, the processor quickly fills up with uncommitted instructions, and computationultimately stalls. To attack this problem, we propose checkpointed early load retirement, a mechanism that combines register check-pointing and back-end-i.e., at retirement-load-value prediction. When a long-latency load hits the ROB head unresolved, the processor enters Clear mode by (1) taking a Checkpoint of the architectural registers, (2) supplying a Load-value prediction to consumers, and (3) EARly-retiring the long-latency load. This unclogs the ROB, thereby "clearing the way" for subsequent instructions to retire, and also allowing instructions dependent on the long-latency load to execute sooner. When the actual value returns from memory, it is compared against the prediction. A mis-prediction causes the processor to roll back to the check-point, discarding all subsequent computation. The benefits of executing in Clear mode come from providing early forward progress on correct predictions, and from warming up caches and other structures on wrong predictions. Our evaluation shows that a Clear implementation with support for four checkpoints yields an average speedup of 1.12 for both eleven integer and eight floating-point applications (1.27 and 1.19 for five integer and five floating-point memory-bound applications, respectively), relative to a contemporary out-of-order processor with an aggressive hardware prefetcher.