Reducing Design Complexity of the Load/Store Queue

Authors:
Il Park;Chong Liang Ooi;T. N. Vijaykumar
Affiliations:
School of Electrical and Computer Engineering, Purdue University;School of Electrical and Computer Engineering, Purdue University;School of Electrical and Computer Engineering, Purdue University
Venue:
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Year:
2003

Citing 7
Cited 35

Dynamic speculation and synchronization of data dependences

Proceedings of the 24th annual international symposium on Computer architecture
Memory dependence prediction using store sets

Proceedings of the 25th annual international symposium on Computer architecture
Energy-effective issue logic

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
A large, fast instruction window for tolerating cache misses

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A scalable instruction queue design using dependence chains

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Reducing register ports for higher speed and lower energy

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
POWER4 system microarchitecture

IBM Journal of Research and Development

A first glance at Kilo-instruction based multiprocessors

Proceedings of the 1st conference on Computing frontiers
Memory Ordering: A Value-Based Approach

Proceedings of the 31st annual international symposium on Computer architecture
Memory Ordering: A Value-Based Approach

IEEE Micro
Scalable Hardware Memory Disambiguation for High-ILP Processors

IEEE Micro
Toward kilo-instruction processors

ACM Transactions on Architecture and Code Optimization (TACO)
On the effectiveness of prefetching and reuse in reducing L1 data cache traffic: a case study of Snort

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Scalable Load and Store Processing in Latency Tolerant Processors

Proceedings of the 32nd annual international symposium on Computer Architecture
Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization

Proceedings of the 32nd annual international symposium on Computer Architecture
Store Buffer Design in First-Level Multibanked Data Caches

Proceedings of the 32nd annual international symposium on Computer Architecture
Snug set-associative caches: reducing leakage power while improving performance

ISLPED '05 Proceedings of the 2005 international symposium on Low power electronics and design
Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Load-Store Queue Management: an Energy-Efficient Design Based on a State-Filtering Mechanism.

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Scalable Store-Load Forwarding via Store Queue Index Prediction

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Address-Indexed Memory Disambiguation and Store-to-Load Forwarding

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Store Memory-Level Parallelism Optimizations for Commercial Applications

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Scalable Load and Store Processing in Latency-Tolerant Processors

IEEE Micro
Slackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification

Proceedings of the 33rd annual international symposium on Computer Architecture
Decomposing the load-store queue by function for power reduction and scalability

IBM Journal of Research and Development
Substituting associative load queue with simple hash tables in out-of-order microprocessors

Proceedings of the 2006 international symposium on Low power electronics and design
NoSQ: Store-Load Communication without a Store Queue

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
DMDC: Delayed Memory Dependence Checking through Age-Based Filtering

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Scalable Cache Miss Handling for High Memory-Level Parallelism

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Snug set-associative caches: Reducing leakage power of instruction and data caches with no performance penalties

ACM Transactions on Architecture and Code Optimization (TACO)
Mechanisms for store-wait-free multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
NoSQ: Store-Load Communication without a Store Queue

IEEE Micro
Power-efficient and scalable load/store queue design via address compression

Proceedings of the 2008 ACM symposium on Applied computing
A Two-Level Load/Store Queue Based on Execution Locality

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Using age registers for a simple load-store queue filtering

Journal of Systems Architecture: the EUROMICRO Journal
On reducing load/store latencies of cache accesses

Journal of Systems Architecture: the EUROMICRO Journal
Exploiting execution locality with a decoupled Kilo-instruction processor

ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
An Efficient Memory Organization for High-ILP Inner Modem Baseband SDR Processors

Journal of Signal Processing Systems
SAMIE-LSQ: set-associative multiple-instruction entry load/store queue

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
L1 data cache power reduction using a forwarding predictor

PATMOS'10 Proceedings of the 20th international conference on Integrated circuit and system design: power and timing modeling, optimization and simulation
A power-efficient and scalable load-store queue design

PATMOS'05 Proceedings of the 15th international conference on Integrated Circuit and System Design: power and Timing Modeling, Optimization and Simulation
Implicit transactional memory in kilo-instruction multiprocessors

ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

With faster CPU clocks and wider pipelines, all relevantmicroarchitecture components should scale accordingly.There have been many proposals for scaling the issue queue,register file, and cache hierarchy. However, nothing has beendone for scaling the load/store queue, despite the increasingpressure on the load/store queue in terms of capacity andsearch bandwidth. The load/store queue is a CAM structurewhich holds in-flight memory instructions and supportssimultaneous searches to honor memory dependencies andmemory consistency models. Therefore, it is difficult to scalethe load/store queue.In this study, we introduce novel techniques to scale theload/store queue. We propose two techniques, store-loadpair predictor and load buffer, to reduce the search bandwidthrequirement; and one technique, segmentation, toscale the size. We show that a load/store queue using ourpredictor and load buffer needs only one port to outperforma conventional two-ported load/store queue. Compared tothe same base case, segmentation alone achieves speedupsof 5% for integer benchmarks and 19% for floating pointbenchmarks. A one-ported load/store queue using all of ourtechniques improves performance on average by 6% and23%, and up to 15% and 59%, for integer and floating-pointbenchmarks, respectively, over a two-ported conventionalload/store queue.