Efficient detection of all pointer and array access errors
PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
ARB: A Hardware Mechanism for Dynamic Reordering of Memory References
IEEE Transactions on Computers
Streamlining inter-operation memory communication via data dependence prediction
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Memory dependence prediction using store sets
Proceedings of the 25th annual international symposium on Computer architecture
Speculative Memory Cloaking and Bypassing
International Journal of Parallel Programming - Special issue on the 30th annual ACM/IEEE international symposium on microarchitecture, part II
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Frequent value locality and value-centric data cache design
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Picking Statistically Valid and Early Simulation Points
Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
Scalable Hardware Memory Disambiguation for High ILP Processors
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Memory Ordering: A Value-Based Approach
Proceedings of the 31st annual international symposium on Computer architecture
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization
Proceedings of the 32nd annual international symposium on Computer Architecture
Store Buffer Design in First-Level Multibanked Data Caches
Proceedings of the 32nd annual international symposium on Computer Architecture
Scalable Store-Load Forwarding via Store Queue Index Prediction
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Address-Indexed Memory Disambiguation and Store-to-Load Forwarding
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
MiBench: A free, commercially representative embedded benchmark suite
WWC '01 Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop
NoSQ: Store-Load Communication without a Store Queue
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
NoSQ: Store-Load Communication without a Store Queue
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Mechanisms for store-wait-free multiprocessors
Proceedings of the 34th annual international symposium on Computer architecture
Late-binding: enabling unordered load-store queues
Proceedings of the 34th annual international symposium on Computer architecture
A modular 3d processor for flexible product design and technology migration
Proceedings of the 5th conference on Computing frontiers
Counting Dependence Predictors
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
A Two-Level Load/Store Queue Based on Execution Locality
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Zero loads: canceling load requests by tracking zero values
Proceedings of the 9th workshop on MEmory performance: DEaling with Applications, systems and architecture
Proceedings of the 36th annual international symposium on Computer architecture
An Efficient Memory Organization for High-ILP Inner Modem Baseband SDR Processors
Journal of Signal Processing Systems
Federation: Boosting per-thread performance of throughput-oriented manycore architectures
ACM Transactions on Architecture and Code Optimization (TACO)
A unified approach to eliminate memory accesses early
CASES '11 Proceedings of the 14th international conference on Compilers, architectures and synthesis for embedded systems
Hi-index | 0.00 |
Modern processors use CAM-based load and store queues (LQ/SQ) to support out-of-order memory scheduling and store-to-load forwarding. However, the LQ and SQ scale poorly for the sizes required for large-window, high- ILP processors. Past research has proposed ways to make the SQ more scalable by reorganizing the CAMs or using non-associative structures. In particular, the Store Queue Index Prediction (SQIP) approach allows for load instructions to predict the exact SQ index of a sourcing store and access the SQ in a much simpler and more scalable RAMbased fashion. The reason why SQIP works is that loads that receive data directly from stores will usually receive the data from the same store each time. In our work, we take a slightly different view on the underlying observation used by SQIP: a store that forwards data to a load usually forwards to the same load each time. This subtle change in perspective leads to our "Fire-and- Forget" (FnF) scheme for load/store scheduling and forwarding that results in the complete elimination of the store queue. The idea is that stores issue out of the reservation stations like regular instructions, and any store that forwards data to a load will use a predicted LQ index to directly write the value to the LQ entry without any associative logic. Any mispredictions/misforwardings are detected by a low-overhead pre-commit re-execution mechanism. Our original goal for FnF was to design a more scalable memory scheduling microarchitecture than the previously proposed approaches without degrading performance. The relative infrequency of store-to-load forwarding, accurate LQ index prediction, and speculative cloaking actually combine to enable FnF to slightly out-perform the competition. Specifically, our simulation results show that our SQless Fire-and-Forget provides a 3.3% speedup over a processor using a conventional fully-associative SQ.