ARB: A Hardware Mechanism for Dynamic Reordering of Memory References
IEEE Transactions on Computers
Dynamic speculation and synchronization of data dependences
Proceedings of the 24th annual international symposium on Computer architecture
Improving the accuracy and performance of memory communication through renaming
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Streamlining inter-operation memory communication via data dependence prediction
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Memory dependence prediction using store sets
Proceedings of the 25th annual international symposium on Computer architecture
Speculation techniques for improving load related instruction scheduling
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Dynamic memory disambiguation in the presence of out-of-order store issuing
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
The Alpha 21264 Microprocessor
IEEE Micro
Three extensions to register integration
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Scalable Hardware Memory Disambiguation for High ILP Processors
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Reducing Design Complexity of the Load/Store Queue
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Memory Ordering: A Value-Based Approach
Proceedings of the 31st annual international symposium on Computer architecture
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Scalable Load and Store Processing in Latency Tolerant Processors
Proceedings of the 32nd annual international symposium on Computer Architecture
Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization
Proceedings of the 32nd annual international symposium on Computer Architecture
Store Buffer Design in First-Level Multibanked Data Caches
Proceedings of the 32nd annual international symposium on Computer Architecture
Understanding Scheduling Replay Schemes
HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
Proceedings of the 33rd annual international symposium on Computer Architecture
Reducing cache traffic and energy with macro data load
Proceedings of the 2006 international symposium on Low power electronics and design
Substituting associative load queue with simple hash tables in out-of-order microprocessors
Proceedings of the 2006 international symposium on Low power electronics and design
Fire-and-Forget: Load/Store Scheduling with No Store Queue at All
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
NoSQ: Store-Load Communication without a Store Queue
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
DMDC: Delayed Memory Dependence Checking through Age-Based Filtering
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Late-binding: enabling unordered load-store queues
Proceedings of the 34th annual international symposium on Computer architecture
Ginger: control independence using tag rewriting
Proceedings of the 34th annual international symposium on Computer architecture
A modular 3d processor for flexible product design and technology migration
Proceedings of the 5th conference on Computing frontiers
Counting Dependence Predictors
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Using age registers for a simple load-store queue filtering
Journal of Systems Architecture: the EUROMICRO Journal
Zero loads: canceling load requests by tracking zero values
Proceedings of the 9th workshop on MEmory performance: DEaling with Applications, systems and architecture
Proceedings of the 36th annual international symposium on Computer architecture
Design and optimization of the store vectors memory dependence predictor
ACM Transactions on Architecture and Code Optimization (TACO)
On reducing load/store latencies of cache accesses
Journal of Systems Architecture: the EUROMICRO Journal
Federation: Boosting per-thread performance of throughput-oriented manycore architectures
ACM Transactions on Architecture and Code Optimization (TACO)
L1 data cache power reduction using a forwarding predictor
PATMOS'10 Proceedings of the 20th international conference on Integrated circuit and system design: power and timing modeling, optimization and simulation
Hi-index | 0.00 |
Conventional processors use a fully-associative store queue (SQ) to implement store-load forwarding. Associative search latency does not scale well to capacities and bandwidths required by wide-issue, large window processors. In this work, we improve SQ scalability by implementing store-load forwarding using speculative indexed access rather than associative search. Our design uses prediction to identify the single SQ entry from which each dynamic load is most likely to forward. When a load executes, it either obtains its value from the predicted SQ entry (if the address of the entry matches the load address) or the data cache (otherwise). A forwarding mis-prediction驴detected by pre-commit filtered load re-execution-results in a pipeline flush. SQ index prediction is generally accurate, but for some loads it cannot reliably identify a single SQ entry. To avoid flushes on these difficult loads while keeping the single-SQ-access-per-load invariant, a second predictor delays difficult loads until all but the youngest of their "candidate" stores have committed. Our predictors are inspired by store-load dependence predictors for load scheduling (Store Sets and the Exclusive Collision Predictor) and unify load scheduling and forwarding. Experiments on the SPEC2000 and MediaBench benchmarks show that on an 8-way issue processor with a 512-entry reorder buffer, our technique performs within 3.3% of an ideal associative SQ (same latency as the data cache) and either matches or exceeds the performance of a realistic associative SQ (slower than data cache) on 31 of 47 programs.