Scalable Hardware Memory Disambiguation for High ILP Processors

Authors:
Simha Sethumadhavan;Rajagopalan Desikan;Doug Burger;Charles R. Moore;Stephen W. Keckler
Affiliations:
Computer Architecture and Technology Laboratory, Department of Computer Sciences;Department of Electrical and Computer Engineering, The University of Texas at Austin;Computer Architecture and Technology Laboratory, Department of Computer Sciences;Computer Architecture and Technology Laboratory, Department of Computer Sciences;Computer Architecture and Technology Laboratory, Department of Computer Sciences
Venue:
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Year:
2003

Citing 22
Cited 47

Critical issues regarding HPS, a high performance microarchitecture

MICRO 18 Proceedings of the 18th annual workshop on Microprogramming
Alpha architecture reference manual

Alpha architecture reference manual
Dynamic memory disambiguation using the memory conflict buffer

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
ARB: A Hardware Mechanism for Dynamic Reordering of Memory References

IEEE Transactions on Computers
Dynamic speculation and synchronization of data dependences

Proceedings of the 24th annual international symposium on Computer architecture
Memory dependence prediction using store sets

Proceedings of the 25th annual international symposium on Computer architecture
Summary cache: a scalable wide-area web cache sharing protocol

IEEE/ACM Transactions on Networking (TON)
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
On the importance of points-to analysis and other memory disambiguation methods for C programs

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Bloom filtering cache misses for accurate data speculation and prefetching

ICS '02 Proceedings of the 16th international conference on Supercomputing
Increasing processor performance by implementing deeper pipelines

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Efficient dynamic scheduling through tag elimination

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A large, fast instruction window for tolerating cache misses

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A design space evaluation of grid processor architectures

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Reducing power requirements of instruction scheduling through dynamic allocation of multiple datapath resources

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Measuring Experimental Error in Microprocessor Simulation

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Introducing the IA-64 Architecture

IEEE Micro
Itanium 2 Processor Microarchitecture

IEEE Micro
Speculative Versioning Cache

HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
Design of a Computer—The Control Data 6600

Design of a Computer—The Control Data 6600
Performance potentials of compiler-directed data speculation

ISPASS '03 Proceedings of the 2003 IEEE International Symposium on Performance Analysis of Systems and Software
POWER4 system microarchitecture

IBM Journal of Research and Development

A first glance at Kilo-instruction based multiprocessors

Proceedings of the 1st conference on Computing frontiers
TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP

ACM Transactions on Architecture and Code Optimization (TACO)
Memory Ordering: A Value-Based Approach

Proceedings of the 31st annual international symposium on Computer architecture
AccMon: Automatically Detecting Memory-Related Bugs via Program Counter-Based Invariants

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Memory Ordering: A Value-Based Approach

IEEE Micro
Scalable Hardware Memory Disambiguation for High-ILP Processors

IEEE Micro
Toward kilo-instruction processors

ACM Transactions on Architecture and Code Optimization (TACO)
Scalable Load and Store Processing in Latency Tolerant Processors

Proceedings of the 32nd annual international symposium on Computer Architecture
Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization

Proceedings of the 32nd annual international symposium on Computer Architecture
Store Buffer Design in First-Level Multibanked Data Caches

Proceedings of the 32nd annual international symposium on Computer Architecture
Virtualizing Transactional Memory

Proceedings of the 32nd annual international symposium on Computer Architecture
Kilo-Instruction Processors: Overcoming the Memory Wall

IEEE Micro
Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Exploiting Coarse-Grain Verification Parallelism for Power-Efficient Fault Tolerance

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Load-Store Queue Management: an Energy-Efficient Design Based on a State-Filtering Mechanism.

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Scalable Store-Load Forwarding via Store Queue Index Prediction

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Address-Indexed Memory Disambiguation and Store-to-Load Forwarding

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Scalable Load and Store Processing in Latency-Tolerant Processors

IEEE Micro
Slackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification

Proceedings of the 33rd annual international symposium on Computer Architecture
Decomposing the load-store queue by function for power reduction and scalability

IBM Journal of Research and Development
SEED: scalable, efficient enforcement of dependences

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
SPARTAN: speculative avoidance of register allocations to transient values for performance and energy efficiency

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Substituting associative load queue with simple hash tables in out-of-order microprocessors

Proceedings of the 2006 international symposium on Low power electronics and design
Reducing energy of virtual cache synonym lookup using bloom filters

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Fire-and-Forget: Load/Store Scheduling with No Store Queue at All

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
NoSQ: Store-Load Communication without a Store Queue

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
DMDC: Delayed Memory Dependence Checking through Age-Based Filtering

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Scalable Cache Miss Handling for High Memory-Level Parallelism

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Late-binding: enabling unordered load-store queues

Proceedings of the 34th annual international symposium on Computer architecture
NoSQ: Store-Load Communication without a Store Queue

IEEE Micro
SoftSig: software-exposed hardware signatures for code analysis and optimization

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Power-efficient and scalable load/store queue design via address compression

Proceedings of the 2008 ACM symposium on Applied computing
A modular 3d processor for flexible product design and technology migration

Proceedings of the 5th conference on Computing frontiers
A Two-Level Load/Store Queue Based on Execution Locality

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Using age registers for a simple load-store queue filtering

Journal of Systems Architecture: the EUROMICRO Journal
A performance-correctness explicitly-decoupled architecture

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Way guard: a segmented counting bloom filter approach to reducing energy for set-associative caches

Proceedings of the 14th ACM/IEEE international symposium on Low power electronics and design
Exploiting execution locality with a decoupled Kilo-instruction processor

ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
DRFX: a simple and efficient memory model for concurrent programming languages

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Federation: Boosting per-thread performance of throughput-oriented manycore architectures

ACM Transactions on Architecture and Code Optimization (TACO)
SAMIE-LSQ: set-associative multiple-instruction entry load/store queue

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
L1 data cache power reduction using a forwarding predictor

PATMOS'10 Proceedings of the 20th international conference on Integrated circuit and system design: power and timing modeling, optimization and simulation
Efficient system-on-chip energy management with a segmented bloom filter

ARCS'06 Proceedings of the 19th international conference on Architecture of Computing Systems
A SAT-based decision procedure for the subclass of unrollable list formulas in ACL2 (SULFA)

IJCAR'06 Proceedings of the Third international joint conference on Automated Reasoning
A power-efficient and scalable load-store queue design

PATMOS'05 Proceedings of the 15th international conference on Integrated Circuit and System Design: power and Timing Modeling, Optimization and Simulation
Formalization of the DE2 language

CHARME'05 Proceedings of the 13 IFIP WG 10.5 international conference on Correct Hardware Design and Verification Methods
Runtime dependency analysis for loop pipelining in high-level synthesis

Proceedings of the 50th Annual Design Automation Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes several methods for improving thescalability of memory disambiguation hardware for futurehigh ILP processors. As the number of in-flight instructionsgrows with issue width and pipeline depth, the load/storequeues (LSQ) threaten to become a bottleneck in both powerand latency. By employing lightweight approximate hashingin hardware with structures called Bloom filters manyimprovements to the LSQ are possible.We propose two types of filtering schemes using Bloomfilters: search filtering, which uses hashing to reduce boththe number of lookups to the LSQ and the number of entriesthat must be searched, and state filtering, in which thenumber of entries kept in the LSQs is reduced by couplingaddress predictors and Bloom filters, permitting smallerqueues. We evaluate these techniques for LSQs indexed byboth instruction age and the instruction's effective address,and for both centralized and physically partitioned LSQs.We show that search filtering avoids up to 98% of the associativeLSQ searches, providing significant power savingsand keeping LSQ searches to under one high-frequencyclock cycle. We also show that with state filtering, the loadqueue can be eliminated altogether with only minor reductionsin performance for small instruction window machines.