High-bandwidth data memory systems for superscalar processors
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Evaluation of A+B=K Conditions Without Carry Propagation
IEEE Transactions on Computers
Designing the TFP Microprocessor
IEEE Micro
Trading conflict and capacity aliasing in conditional branch predictors
Proceedings of the 24th annual international symposium on Computer architecture
Memory dependence prediction using store sets
Proceedings of the 25th annual international symposium on Computer architecture
Speculation techniques for improving load related instruction scheduling
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Inherently Lower-Power High-Performance Superscalar Architectures
IEEE Transactions on Computers
A large, fast instruction window for tolerating cache misses
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Design tradeoffs for the Alpha EV8 conditional branch predictor
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Automatically characterizing large scale program behavior
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
IEEE Micro
Partitioned first-level cache design for clustered microarchitectures
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Dynamically managing the communication-parallelism trade-off in future clustered processors
Proceedings of the 30th annual international symposium on Computer architecture
HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Scalable Hardware Memory Disambiguation for High ILP Processors
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Reducing Design Complexity of the Load/Store Queue
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Load-Store Queue Management: an Energy-Efficient Design Based on a State-Filtering Mechanism.
ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Scalable Store-Load Forwarding via Store Queue Index Prediction
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Address-Indexed Memory Disambiguation and Store-to-Load Forwarding
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 33rd annual international symposium on Computer Architecture
Decomposing the load-store queue by function for power reduction and scalability
IBM Journal of Research and Development
Substituting associative load queue with simple hash tables in out-of-order microprocessors
Proceedings of the 2006 international symposium on Low power electronics and design
Fire-and-Forget: Load/Store Scheduling with No Store Queue at All
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
NoSQ: Store-Load Communication without a Store Queue
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
DMDC: Delayed Memory Dependence Checking through Age-Based Filtering
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Mechanisms for store-wait-free multiprocessors
Proceedings of the 34th annual international symposium on Computer architecture
Late-binding: enabling unordered load-store queues
Proceedings of the 34th annual international symposium on Computer architecture
A performance-correctness explicitly-decoupled architecture
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.00 |
This paper focuses on how to design a Store Buffer (STB) well suited to first-level multibanked data caches. Our goal is to forward data from in-flight stores to dependent loads with the latency of a cache bank. For that we propose a particular two-level STB design in which forwarding is done speculatively from a distributed first-level STB made of extremely small banks, while a centralized, second-level STB enforces correct store-load ordering a few cycles later. To that end we have identified several important design decisions: i) delaying allocation of first-level STB entries until stores execute, ii) deallocating first-level STB entries before stores commit, and iii) selecting a recovery policy well-matched to data forwarding misspeculations. Moreover, the two-level STB admits two enhancements that simplify the design leaving performance almost unchanged: i) removing the data forwarding capability from the second-level STB, and ii) not checking instruction age in first-level STB prior to forwarding data to loads. Following our guidelines and running SPECint-2K over an 8-way out-of-order processor, a two-level STB (first level with four STB banks of 8 entries each) performs similarly to an ideal, single-level STB with 128-entry banks working at the first-level cache latency.