Race-free interconnection networks and multiprocessor consistency
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Memory dependence prediction using store sets
Proceedings of the 25th annual international symposium on Computer architecture
Dynamic memory disambiguation in the presence of out-of-order store issuing
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
DIVA: a reliable substrate for deep submicron microarchitecture design
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
On the value locality of store instructions
Proceedings of the 27th annual international symposium on Computer architecture
Automatable verification of sequential consistency
Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Complete Computer System Simulation: The SimOS Approach
IEEE Parallel & Distributed Technology: Systems & Technology
The MIPS R10000 Superscalar Microprocessor
IEEE Micro
Cherry: checkpointed early resource recycling in out-of-order microprocessors
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Variability in Architectural Simulations of Multi-Threaded Workloads
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Scalable Hardware Memory Disambiguation for High ILP Processors
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Reducing Design Complexity of the Load/Store Queue
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research
IEEE Computer Architecture Letters
Toward kilo-instruction processors
ACM Transactions on Architecture and Code Optimization (TACO)
Scalable Load and Store Processing in Latency Tolerant Processors
Proceedings of the 32nd annual international symposium on Computer Architecture
Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization
Proceedings of the 32nd annual international symposium on Computer Architecture
Load-Store Queue Management: an Energy-Efficient Design Based on a State-Filtering Mechanism.
ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Scalable Store-Load Forwarding via Store Queue Index Prediction
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Address-Indexed Memory Disambiguation and Store-to-Load Forwarding
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 33rd annual international symposium on Computer Architecture
Decomposing the load-store queue by function for power reduction and scalability
IBM Journal of Research and Development
Substituting associative load queue with simple hash tables in out-of-order microprocessors
Proceedings of the 2006 international symposium on Low power electronics and design
A regulated transitive reduction (RTR) for longer memory race recording
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
TMA: a trap-based memory architecture
Proceedings of the 20th annual international conference on Supercomputing
Fire-and-Forget: Load/Store Scheduling with No Store Queue at All
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
NoSQ: Store-Load Communication without a Store Queue
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
DMDC: Delayed Memory Dependence Checking through Age-Based Filtering
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
BulkSC: bulk enforcement of sequential consistency
Proceedings of the 34th annual international symposium on Computer architecture
Late-binding: enabling unordered load-store queues
Proceedings of the 34th annual international symposium on Computer architecture
Ginger: control independence using tag rewriting
Proceedings of the 34th annual international symposium on Computer architecture
Power-efficient and scalable load/store queue design via address compression
Proceedings of the 2008 ACM symposium on Applied computing
A modular 3d processor for flexible product design and technology migration
Proceedings of the 5th conference on Computing frontiers
A Two-Level Load/Store Queue Based on Execution Locality
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Dynamic parallelization of single-threaded binary programs using speculative slicing
Proceedings of the 23rd international conference on Supercomputing
InvisiFence: performance-transparent memory ordering in conventional multiprocessors
Proceedings of the 36th annual international symposium on Computer architecture
Proceedings of the 36th annual international symposium on Computer architecture
Design and optimization of the store vectors memory dependence predictor
ACM Transactions on Architecture and Code Optimization (TACO)
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
SAMIE-LSQ: set-associative multiple-instruction entry load/store queue
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A power-efficient and scalable load-store queue design
PATMOS'05 Proceedings of the 15th international conference on Integrated Circuit and System Design: power and Timing Modeling, Optimization and Simulation
Exploring memory consistency for massively-threaded throughput-oriented processors
Proceedings of the 40th Annual International Symposium on Computer Architecture
Hi-index | 0.00 |
Conventional out-of-order processors employ a multi-ported,fully-associative load queue to guarantee correctmemory reference order both within a single thread of executionand across threads in a multiprocessor system. Asimprovements in process technology and pipelining lead tohigher clock frequencies, scaling this complex structure toaccommodate a larger number of in-flight loads becomesdifficult if not impossible. Furthermore, each access to thiscomplex structure consumes excessive amounts of energy.In this paper, we solve the associative load queue scalabilityproblem by completely eliminating the associative loadqueue. Instead, data dependences and memory consistencyconstraints are enforced by simply re-executing loadinstructions in program order prior to retirement. Usingheuristics to filter the set of loads that must be re-executed,we show that our replay-based mechanism enables a simple,scalable, and energy-efficient FIFO load queue designwith no associative lookup functionality, while sacrificingonly a negligible amount of performance and cache bandwidth.