Fire-and-Forget: Load/Store Scheduling with No Store Queue at All

Authors:
Samantika Subramaniam;Gabriel H. Loh
Affiliations:
Georgia Institute of Technology;Georgia Institute of Technology
Venue:
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2006

Citing 19
Cited 12

Efficient detection of all pointer and array access errors

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
ARB: A Hardware Mechanism for Dynamic Reordering of Memory References

IEEE Transactions on Computers
Streamlining inter-operation memory communication via data dependence prediction

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Memory dependence prediction using store sets

Proceedings of the 25th annual international symposium on Computer architecture
Speculative Memory Cloaking and Bypassing

International Journal of Parallel Programming - Special issue on the 30th annual ACM/IEEE international symposium on microarchitecture, part II
Silent stores for free

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Frequent value locality and value-centric data cache design

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
SimpleScalar: An Infrastructure for Computer System Modeling

Computer
Picking Statistically Valid and Early Simulation Points

Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
Scalable Hardware Memory Disambiguation for High ILP Processors

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Memory Ordering: A Value-Based Approach

Proceedings of the 31st annual international symposium on Computer architecture
Using Virtual Load/Store Queues (VLSQs) to Reduce the Negative Effects of Reordered Memory Instructions

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization

Proceedings of the 32nd annual international symposium on Computer Architecture
Store Buffer Design in First-Level Multibanked Data Caches

Proceedings of the 32nd annual international symposium on Computer Architecture
Scalable Store-Load Forwarding via Store Queue Index Prediction

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Address-Indexed Memory Disambiguation and Store-to-Load Forwarding

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
MiBench: A free, commercially representative embedded benchmark suite

WWC '01 Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop
NoSQ: Store-Load Communication without a Store Queue

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture

NoSQ: Store-Load Communication without a Store Queue

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Mechanisms for store-wait-free multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
Late-binding: enabling unordered load-store queues

Proceedings of the 34th annual international symposium on Computer architecture
NoSQ: Store-Load Communication without a Store Queue

IEEE Micro
A modular 3d processor for flexible product design and technology migration

Proceedings of the 5th conference on Computing frontiers
Counting Dependence Predictors

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
A Two-Level Load/Store Queue Based on Execution Locality

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Zero loads: canceling load requests by tracking zero values

Proceedings of the 9th workshop on MEmory performance: DEaling with Applications, systems and architecture
Decoupled store completion/silent deterministic replay: enabling scalable data memory for CPR/CFP processors

Proceedings of the 36th annual international symposium on Computer architecture
An Efficient Memory Organization for High-ILP Inner Modem Baseband SDR Processors

Journal of Signal Processing Systems
Federation: Boosting per-thread performance of throughput-oriented manycore architectures

ACM Transactions on Architecture and Code Optimization (TACO)
A unified approach to eliminate memory accesses early

CASES '11 Proceedings of the 14th international conference on Compilers, architectures and synthesis for embedded systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern processors use CAM-based load and store queues (LQ/SQ) to support out-of-order memory scheduling and store-to-load forwarding. However, the LQ and SQ scale poorly for the sizes required for large-window, high- ILP processors. Past research has proposed ways to make the SQ more scalable by reorganizing the CAMs or using non-associative structures. In particular, the Store Queue Index Prediction (SQIP) approach allows for load instructions to predict the exact SQ index of a sourcing store and access the SQ in a much simpler and more scalable RAMbased fashion. The reason why SQIP works is that loads that receive data directly from stores will usually receive the data from the same store each time. In our work, we take a slightly different view on the underlying observation used by SQIP: a store that forwards data to a load usually forwards to the same load each time. This subtle change in perspective leads to our "Fire-and- Forget" (FnF) scheme for load/store scheduling and forwarding that results in the complete elimination of the store queue. The idea is that stores issue out of the reservation stations like regular instructions, and any store that forwards data to a load will use a predicted LQ index to directly write the value to the LQ entry without any associative logic. Any mispredictions/misforwardings are detected by a low-overhead pre-commit re-execution mechanism. Our original goal for FnF was to design a more scalable memory scheduling microarchitecture than the previously proposed approaches without degrading performance. The relative infrequency of store-to-load forwarding, accurate LQ index prediction, and speculative cloaking actually combine to enable FnF to slightly out-perform the competition. Specifically, our simulation results show that our SQless Fire-and-Forget provides a 3.3% speedup over a processor using a conventional fully-associative SQ.