Late-binding: enabling unordered load-store queues

Authors:
Simha Sethumadhavan;Franziska Roesner;Joel S. Emer;Doug Burger;Stephen W. Keckler
Affiliations:
The University of Texas at Austin, Austin, TX;The University of Texas at Austin, Austin, TX;Intel Corporation, Boston, MA;The University of Texas at Austin, Austin, TX;The University of Texas at Austin, Austin, TX
Venue:
Proceedings of the 34th annual international symposium on Computer architecture
Year:
2007

Citing 28
Cited 8

ARB: A Hardware Mechanism for Dynamic Reordering of Memory References

IEEE Transactions on Computers
Delaying physical register allocation through virtual-physical registers

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Measuring Experimental Error in Microprocessor Simulation

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Tradeoffs in power-efficient issue queue design

Proceedings of the 2002 international symposium on Low power electronics and design
Orion: a power-performance simulator for interconnection networks

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Loose Loops Sink Chips

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Inherently lower-power high-performance superscalar architectures

Inherently lower-power high-performance superscalar architectures
Scalable Hardware Memory Disambiguation for High ILP Processors

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Dynamic management of microarchitecture resources in future microprocessors

Dynamic management of microarchitecture resources in future microprocessors
Memory Ordering: A Value-Based Approach

Proceedings of the 31st annual international symposium on Computer architecture
Scaling to the End of Silicon with EDGE Architectures

Computer
Continual flow pipelines

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Using Virtual Load/Store Queues (VLSQs) to Reduce the Negative Effects of Reordered Memory Instructions

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization

Proceedings of the 32nd annual international symposium on Computer Architecture
Store Buffer Design in First-Level Multibanked Data Caches

Proceedings of the 32nd annual international symposium on Computer Architecture
Kilo-Instruction Processors: Overcoming the Memory Wall

IEEE Micro
Memory Bank Predictors

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Scalable Store-Load Forwarding via Store Queue Index Prediction

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Address-Indexed Memory Disambiguation and Store-to-Load Forwarding

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Slackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification

Proceedings of the 33rd annual international symposium on Computer Architecture
Decomposing the load-store queue by function for power reduction and scalability

IBM Journal of Research and Development
L-CBF: a low-power, fast counting bloom filter architecture

Proceedings of the 2006 international symposium on Low power electronics and design
Fire-and-Forget: Load/Store Scheduling with No Store Queue at All

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
NoSQ: Store-Load Communication without a Store Queue

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
DMDC: Delayed Memory Dependence Checking through Age-Based Filtering

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
POWER4 system microarchitecture

IBM Journal of Research and Development

Exploiting access semantics and program behavior to reduce snoop power in chip multiprocessors

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
A Two-Level Load/Store Queue Based on Execution Locality

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Decoupled store completion/silent deterministic replay: enabling scalable data memory for CPR/CFP processors

Proceedings of the 36th annual international symposium on Computer architecture
An Efficient Memory Organization for High-ILP Inner Modem Baseband SDR Processors

Journal of Signal Processing Systems
Dynamic vectorization in the E2 dynamic multicore architecture

ACM SIGARCH Computer Architecture News
CoreSymphony: an efficient reconfigurable multi-core architecture

ACM SIGARCH Computer Architecture News
Distributed replay protocol for distributed uniprocessors

Proceedings of the 26th ACM international conference on Supercomputing
The sharing architecture: sub-core configurability for IaaS clouds

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Conventional load/store queues (LSQs) are an impediment to both power-efficient execution in superscalar processors and scaling tolarge-window designs. In this paper, we propose techniques to improve the area and power efficiency of LSQs by allocating entries when instructions issue ("late binding"), rather than when they are dispatched. This approach enables lower occupancy and thus smaller LSQs. Efficient implementations of late-binding LSQs, however, require the entries in the LSQ to be unordered with respect to age. In this paper, we show how to provide full LSQ functionality in an unordered design with only small additional complexity and negligible performance losses. We show that late-binding, unordered LSQs work well for small-window superscalar processors, but can also be scaled effectively to large, kilo-window processors by breaking the LSQs into address-interleaved banks. To handle the increased overflows, we apply classic network flow control techniques to the processor micronetworks, enabling low-overhead recovery mechanisms from bank overflows. We evaluate three such mechanisms: instruction replay, skid buffers, an dvirtual-channel buffering in the on-chip memory network. We show that for an 80-instruction window, the LSQ can be reduced to 32 entries. For a 1024-instruction window, the unordered, late-binding LSQ works well with four banks of 48 entries each. By applying a Bloom filter as well, this design achieves full hardware memory disambiguation for a 1,024 instruction window while requiring low average power per load and store access of 8 and 12 CAM entries, respectively.