Address-Indexed Memory Disambiguation and Store-to-Load Forwarding

Authors:
Sam S. Stone;Kevin M. Woley;Matthew I. Frank
Affiliations:
University of Illinois, Urbana-Champaign;University of Illinois, Urbana-Champaign;University of Illinois, Urbana-Champaign
Venue:
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Year:
2005

Citing 21
Cited 11

The expandable split window paradigm for exploiting fine-grain parallelsim

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Dynamic memory disambiguation using the memory conflict buffer

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Multiscalar processors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
ARB: A Hardware Mechanism for Dynamic Reordering of Memory References

IEEE Transactions on Computers
Dynamic speculation and synchronization of data dependences

Proceedings of the 24th annual international symposium on Computer architecture
Improving the accuracy and performance of memory communication through renaming

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Streamlining inter-operation memory communication via data dependence prediction

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Memory dependence prediction using store sets

Proceedings of the 25th annual international symposium on Computer architecture
A novel renaming scheme to exploit value temporal locality through physical register reuse and unification

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
An architecture for mostly functional languages

LFP '86 Proceedings of the 1986 ACM conference on LISP and functional programming
Dynamic memory disambiguation in the presence of out-of-order store issuing

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Implementing atomic actions on decentralized data

ACM Transactions on Computer Systems (TOCS)
The MIPS R10000 Superscalar Microprocessor

IEEE Micro
The Alpha 21264: A 500 MHz Out-of-Order Execution Microprocessor

COMPCON '97 Proceedings of the 42nd IEEE International Computer Conference
Scalable Hardware Memory Disambiguation for High ILP Processors

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Reducing Design Complexity of the Load/Store Queue

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Memory Ordering: A Value-Based Approach

Proceedings of the 31st annual international symposium on Computer architecture
Scalable Load and Store Processing in Latency Tolerant Processors

Proceedings of the 32nd annual international symposium on Computer Architecture
Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization

Proceedings of the 32nd annual international symposium on Computer Architecture
Store Buffer Design in First-Level Multibanked Data Caches

Proceedings of the 32nd annual international symposium on Computer Architecture

Slackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification

Proceedings of the 33rd annual international symposium on Computer Architecture
Substituting associative load queue with simple hash tables in out-of-order microprocessors

Proceedings of the 2006 international symposium on Low power electronics and design
Fire-and-Forget: Load/Store Scheduling with No Store Queue at All

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
NoSQ: Store-Load Communication without a Store Queue

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
DMDC: Delayed Memory Dependence Checking through Age-Based Filtering

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Late-binding: enabling unordered load-store queues

Proceedings of the 34th annual international symposium on Computer architecture
NoSQ: Store-Load Communication without a Store Queue

IEEE Micro
A modular 3d processor for flexible product design and technology migration

Proceedings of the 5th conference on Computing frontiers
Counting Dependence Predictors

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Zero loads: canceling load requests by tracking zero values

Proceedings of the 9th workshop on MEmory performance: DEaling with Applications, systems and architecture
A unified approach to eliminate memory accesses early

CASES '11 Proceedings of the 14th international conference on Compilers, architectures and synthesis for embedded systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper describes a scalable, low-complexity alternative to the conventional load/store queue (LSQ) for superscalar processors that execute load and store instructions speculatively and out-of-order prior to resolving their dependences. Whereas the LSQ requires associative and age-prioritized searches for each access, we propose that an address-indexed store-forwarding cache (SFC) perform store-to-load forwarding and that an address-indexed memory disambiguation table (MDT) perform memory disambiguation. Neither structure includes a CAM. The SFC behaves as a small cache, accessed speculatively and out-oforder by both loads and stores. Because the SFC does not rename in-flight stores to the same address, violations of memory anti and output dependences can cause in-flight loads to obtain incorrect values from the SFC. Therefore, the MDT uses sequence numbers to detect and recover from true, anti, and output memory dependence violations. We observe empirically that loads and stores that violate anti and output memory dependences are rarely on a program驴s critical path and that the additional cost of enforcing predicted anti and output dependences among these loads and stores is minimal. In conjunction with a scheduler that enforces predicted anti and output dependences, the MDT and SFC yield performance equivalent to that of a large LSQ that has similar or greater circuit complexity. The SFC and MDT are scalable structures that yield high performance and lower dynamic power consumption than the LSQ, and they are well-suited for checkpointed processors with large instruction windows.