Memory Ordering: A Value-Based Approach

Authors:
Harold W. Cain;Mikko H. Lipasti
Affiliations:
Univ. of Wisconsin-Madison;Univ. of Wisconsin-Madison
Venue:
Proceedings of the 31st annual international symposium on Computer architecture
Year:
2004

Citing 17
Cited 31

Race-free interconnection networks and multiprocessor consistency

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Memory dependence prediction using store sets

Proceedings of the 25th annual international symposium on Computer architecture
Dynamic memory disambiguation in the presence of out-of-order store issuing

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
DIVA: a reliable substrate for deep submicron microarchitecture design

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
On the value locality of store instructions

Proceedings of the 27th annual international symposium on Computer architecture
Automatable verification of sequential consistency

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Reducing power requirements of instruction scheduling through dynamic allocation of multiple datapath resources

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Correctly implementing value prediction in microprocessors that support multithreading or multiprocessing

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Complete Computer System Simulation: The SimOS Approach

IEEE Parallel & Distributed Technology: Systems & Technology
The MIPS R10000 Superscalar Microprocessor

IEEE Micro
Cherry: checkpointed early resource recycling in out-of-order microprocessors

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Variability in Architectural Simulations of Multi-Threaded Workloads

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Scalable Hardware Memory Disambiguation for High ILP Processors

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Reducing Design Complexity of the Load/Store Queue

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research

IEEE Computer Architecture Letters

Scalable Hardware Memory Disambiguation for High-ILP Processors

IEEE Micro
Toward kilo-instruction processors

ACM Transactions on Architecture and Code Optimization (TACO)
Scalable Load and Store Processing in Latency Tolerant Processors

Proceedings of the 32nd annual international symposium on Computer Architecture
Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization

Proceedings of the 32nd annual international symposium on Computer Architecture
Load-Store Queue Management: an Energy-Efficient Design Based on a State-Filtering Mechanism.

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Scalable Store-Load Forwarding via Store Queue Index Prediction

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Address-Indexed Memory Disambiguation and Store-to-Load Forwarding

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Scalable Load and Store Processing in Latency-Tolerant Processors

IEEE Micro
Slackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification

Proceedings of the 33rd annual international symposium on Computer Architecture
Decomposing the load-store queue by function for power reduction and scalability

IBM Journal of Research and Development
Substituting associative load queue with simple hash tables in out-of-order microprocessors

Proceedings of the 2006 international symposium on Low power electronics and design
A regulated transitive reduction (RTR) for longer memory race recording

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
TMA: a trap-based memory architecture

Proceedings of the 20th annual international conference on Supercomputing
Fire-and-Forget: Load/Store Scheduling with No Store Queue at All

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
NoSQ: Store-Load Communication without a Store Queue

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
DMDC: Delayed Memory Dependence Checking through Age-Based Filtering

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
BulkSC: bulk enforcement of sequential consistency

Proceedings of the 34th annual international symposium on Computer architecture
Late-binding: enabling unordered load-store queues

Proceedings of the 34th annual international symposium on Computer architecture
Ginger: control independence using tag rewriting

Proceedings of the 34th annual international symposium on Computer architecture
NoSQ: Store-Load Communication without a Store Queue

IEEE Micro
Power-efficient and scalable load/store queue design via address compression

Proceedings of the 2008 ACM symposium on Applied computing
A modular 3d processor for flexible product design and technology migration

Proceedings of the 5th conference on Computing frontiers
A Two-Level Load/Store Queue Based on Execution Locality

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Dynamic parallelization of single-threaded binary programs using speculative slicing

Proceedings of the 23rd international conference on Supercomputing
InvisiFence: performance-transparent memory ordering in conventional multiprocessors

Proceedings of the 36th annual international symposium on Computer architecture
Decoupled store completion/silent deterministic replay: enabling scalable data memory for CPR/CFP processors

Proceedings of the 36th annual international symposium on Computer architecture
Design and optimization of the store vectors memory dependence predictor

ACM Transactions on Architecture and Code Optimization (TACO)
BulkCompiler: high-performance sequential consistency through cooperative compiler and hardware support

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
SAMIE-LSQ: set-associative multiple-instruction entry load/store queue

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A power-efficient and scalable load-store queue design

PATMOS'05 Proceedings of the 15th international conference on Integrated Circuit and System Design: power and Timing Modeling, Optimization and Simulation
Exploring memory consistency for massively-threaded throughput-oriented processors

Proceedings of the 40th Annual International Symposium on Computer Architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Conventional out-of-order processors employ a multi-ported,fully-associative load queue to guarantee correctmemory reference order both within a single thread of executionand across threads in a multiprocessor system. Asimprovements in process technology and pipelining lead tohigher clock frequencies, scaling this complex structure toaccommodate a larger number of in-flight loads becomesdifficult if not impossible. Furthermore, each access to thiscomplex structure consumes excessive amounts of energy.In this paper, we solve the associative load queue scalabilityproblem by completely eliminating the associative loadqueue. Instead, data dependences and memory consistencyconstraints are enforced by simply re-executing loadinstructions in program order prior to retirement. Usingheuristics to filter the set of loads that must be re-executed,we show that our replay-based mechanism enables a simple,scalable, and energy-efficient FIFO load queue designwith no associative lookup functionality, while sacrificingonly a negligible amount of performance and cache bandwidth.