Scalable Store-Load Forwarding via Store Queue Index Prediction

Authors:
Tingting Sha;Milo M. K. Martin;Amir Roth
Affiliations:
University of Pennsylvania;University of Pennsylvania;University of Pennsylvania
Venue:
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Year:
2005

Citing 17
Cited 18

ARB: A Hardware Mechanism for Dynamic Reordering of Memory References

IEEE Transactions on Computers
Dynamic speculation and synchronization of data dependences

Proceedings of the 24th annual international symposium on Computer architecture
Improving the accuracy and performance of memory communication through renaming

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Streamlining inter-operation memory communication via data dependence prediction

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Memory dependence prediction using store sets

Proceedings of the 25th annual international symposium on Computer architecture
Speculation techniques for improving load related instruction scheduling

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Dynamic memory disambiguation in the presence of out-of-order store issuing

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
The Alpha 21264 Microprocessor

IEEE Micro
Three extensions to register integration

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Scalable Hardware Memory Disambiguation for High ILP Processors

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Reducing Design Complexity of the Load/Store Queue

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Memory Ordering: A Value-Based Approach

Proceedings of the 31st annual international symposium on Computer architecture
Continual flow pipelines

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Scalable Load and Store Processing in Latency Tolerant Processors

Proceedings of the 32nd annual international symposium on Computer Architecture
Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization

Proceedings of the 32nd annual international symposium on Computer Architecture
Store Buffer Design in First-Level Multibanked Data Caches

Proceedings of the 32nd annual international symposium on Computer Architecture
Understanding Scheduling Replay Schemes

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture

Slackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification

Proceedings of the 33rd annual international symposium on Computer Architecture
Reducing cache traffic and energy with macro data load

Proceedings of the 2006 international symposium on Low power electronics and design
Substituting associative load queue with simple hash tables in out-of-order microprocessors

Proceedings of the 2006 international symposium on Low power electronics and design
Fire-and-Forget: Load/Store Scheduling with No Store Queue at All

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
NoSQ: Store-Load Communication without a Store Queue

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
DMDC: Delayed Memory Dependence Checking through Age-Based Filtering

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Late-binding: enabling unordered load-store queues

Proceedings of the 34th annual international symposium on Computer architecture
Ginger: control independence using tag rewriting

Proceedings of the 34th annual international symposium on Computer architecture
NoSQ: Store-Load Communication without a Store Queue

IEEE Micro
A modular 3d processor for flexible product design and technology migration

Proceedings of the 5th conference on Computing frontiers
Counting Dependence Predictors

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Using age registers for a simple load-store queue filtering

Journal of Systems Architecture: the EUROMICRO Journal
Zero loads: canceling load requests by tracking zero values

Proceedings of the 9th workshop on MEmory performance: DEaling with Applications, systems and architecture
Decoupled store completion/silent deterministic replay: enabling scalable data memory for CPR/CFP processors

Proceedings of the 36th annual international symposium on Computer architecture
Design and optimization of the store vectors memory dependence predictor

ACM Transactions on Architecture and Code Optimization (TACO)
On reducing load/store latencies of cache accesses

Journal of Systems Architecture: the EUROMICRO Journal
Federation: Boosting per-thread performance of throughput-oriented manycore architectures

ACM Transactions on Architecture and Code Optimization (TACO)
L1 data cache power reduction using a forwarding predictor

PATMOS'10 Proceedings of the 20th international conference on Integrated circuit and system design: power and timing modeling, optimization and simulation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Conventional processors use a fully-associative store queue (SQ) to implement store-load forwarding. Associative search latency does not scale well to capacities and bandwidths required by wide-issue, large window processors. In this work, we improve SQ scalability by implementing store-load forwarding using speculative indexed access rather than associative search. Our design uses prediction to identify the single SQ entry from which each dynamic load is most likely to forward. When a load executes, it either obtains its value from the predicted SQ entry (if the address of the entry matches the load address) or the data cache (otherwise). A forwarding mis-prediction驴detected by pre-commit filtered load re-execution-results in a pipeline flush. SQ index prediction is generally accurate, but for some loads it cannot reliably identify a single SQ entry. To avoid flushes on these difficult loads while keeping the single-SQ-access-per-load invariant, a second predictor delays difficult loads until all but the youngest of their "candidate" stores have committed. Our predictors are inspired by store-load dependence predictors for load scheduling (Store Sets and the Exclusive Collision Predictor) and unify load scheduling and forwarding. Experiments on the SPEC2000 and MediaBench benchmarks show that on an 8-way issue processor with a 512-entry reorder buffer, our technique performs within 3.3% of an ideal associative SQ (same latency as the data cache) and either matches or exceeds the performance of a realistic associative SQ (slower than data cache) on 31 of 47 programs.