Dynamic speculation and synchronization of data dependences
Proceedings of the 24th annual international symposium on Computer architecture
Memory dependence prediction using store sets
Proceedings of the 25th annual international symposium on Computer architecture
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
A large, fast instruction window for tolerating cache misses
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A scalable instruction queue design using dependence chains
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Reducing register ports for higher speed and lower energy
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
POWER4 system microarchitecture
IBM Journal of Research and Development
A first glance at Kilo-instruction based multiprocessors
Proceedings of the 1st conference on Computing frontiers
Memory Ordering: A Value-Based Approach
Proceedings of the 31st annual international symposium on Computer architecture
Memory Ordering: A Value-Based Approach
IEEE Micro
Toward kilo-instruction processors
ACM Transactions on Architecture and Code Optimization (TACO)
WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Scalable Load and Store Processing in Latency Tolerant Processors
Proceedings of the 32nd annual international symposium on Computer Architecture
Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization
Proceedings of the 32nd annual international symposium on Computer Architecture
Store Buffer Design in First-Level Multibanked Data Caches
Proceedings of the 32nd annual international symposium on Computer Architecture
Snug set-associative caches: reducing leakage power while improving performance
ISLPED '05 Proceedings of the 2005 international symposium on Low power electronics and design
Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Load-Store Queue Management: an Energy-Efficient Design Based on a State-Filtering Mechanism.
ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Scalable Store-Load Forwarding via Store Queue Index Prediction
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Address-Indexed Memory Disambiguation and Store-to-Load Forwarding
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Store Memory-Level Parallelism Optimizations for Commercial Applications
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 33rd annual international symposium on Computer Architecture
Decomposing the load-store queue by function for power reduction and scalability
IBM Journal of Research and Development
Substituting associative load queue with simple hash tables in out-of-order microprocessors
Proceedings of the 2006 international symposium on Low power electronics and design
NoSQ: Store-Load Communication without a Store Queue
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
DMDC: Delayed Memory Dependence Checking through Age-Based Filtering
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Scalable Cache Miss Handling for High Memory-Level Parallelism
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
ACM Transactions on Architecture and Code Optimization (TACO)
Mechanisms for store-wait-free multiprocessors
Proceedings of the 34th annual international symposium on Computer architecture
Power-efficient and scalable load/store queue design via address compression
Proceedings of the 2008 ACM symposium on Applied computing
A Two-Level Load/Store Queue Based on Execution Locality
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Using age registers for a simple load-store queue filtering
Journal of Systems Architecture: the EUROMICRO Journal
On reducing load/store latencies of cache accesses
Journal of Systems Architecture: the EUROMICRO Journal
Exploiting execution locality with a decoupled Kilo-instruction processor
ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
An Efficient Memory Organization for High-ILP Inner Modem Baseband SDR Processors
Journal of Signal Processing Systems
SAMIE-LSQ: set-associative multiple-instruction entry load/store queue
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
L1 data cache power reduction using a forwarding predictor
PATMOS'10 Proceedings of the 20th international conference on Integrated circuit and system design: power and timing modeling, optimization and simulation
A power-efficient and scalable load-store queue design
PATMOS'05 Proceedings of the 15th international conference on Integrated Circuit and System Design: power and Timing Modeling, Optimization and Simulation
Implicit transactional memory in kilo-instruction multiprocessors
ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
Hi-index | 0.00 |
With faster CPU clocks and wider pipelines, all relevantmicroarchitecture components should scale accordingly.There have been many proposals for scaling the issue queue,register file, and cache hierarchy. However, nothing has beendone for scaling the load/store queue, despite the increasingpressure on the load/store queue in terms of capacity andsearch bandwidth. The load/store queue is a CAM structurewhich holds in-flight memory instructions and supportssimultaneous searches to honor memory dependencies andmemory consistency models. Therefore, it is difficult to scalethe load/store queue.In this study, we introduce novel techniques to scale theload/store queue. We propose two techniques, store-loadpair predictor and load buffer, to reduce the search bandwidthrequirement; and one technique, segmentation, toscale the size. We show that a load/store queue using ourpredictor and load buffer needs only one port to outperforma conventional two-ported load/store queue. Compared tothe same base case, segmentation alone achieves speedupsof 5% for integer benchmarks and 19% for floating pointbenchmarks. A one-ported load/store queue using all of ourtechniques improves performance on average by 6% and23%, and up to 15% and 59%, for integer and floating-pointbenchmarks, respectively, over a two-ported conventionalload/store queue.