Complexity-Effective Reorder Buffer Designs for Superscalar Processors

Authors:
Gurhan Kucuk;Dmitry V. Ponomarev;Oguz Ergin;Kanad Ghose
Affiliations:
-;-;-;-
Venue:
IEEE Transactions on Computers
Year:
2004

Citing 18
Cited 2

Instruction issue logic for high-performance, interruptable pipelined processors

ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Exploiting short-lived variables in superscalar processors

Proceedings of the 28th annual international symposium on Microarchitecture
The multicluster architecture: reducing cycle time through partitioning

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Implementation of precise interrupts in pipelined processors

ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
Multiple-banked register file architectures

Proceedings of the 27th annual international symposium on Computer architecture
Energy-effective issue logic

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Energy: efficient instruction dispatch buffer design for superscalar processors

ISLPED '01 Proceedings of the 2001 international symposium on Low power electronics and design
Low-complexity reorder buffer architecture

ICS '02 Proceedings of the 16th international conference on Supercomputing
Reducing the complexity of the register file in dynamic superscalar processors

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
The PowerPC 604 RISC microprocessor

IEEE Micro
The Alpha 21264 Microprocessor

IEEE Micro
Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors

IEEE Micro
Energy-Efficient Design of the Reorder Buffer

PATMOS '02 Proceedings of the 12th International Workshop on Integrated Circuit Design. Power and Timing Modeling, Optimization and Simulation
Reducing register ports for higher speed and lower energy

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Non-Consistent Dual Register Files to Reduce Register Pressure

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Loose Loops Sink Chips

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
A Scalable Register File Architecture for Dynamically Scheduled Processors

PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
Reducing Datapath Energy through the Isolation of Short-Lived Operands

Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques

VAIL: variation-aware issue logic and performance binning for processor yield and profit improvement

Proceedings of the 16th ACM/IEEE international symposium on Low power electronics and design
Saving register-file leakage power by monitoring instruction sequence in ROB

EUC'06 Proceedings of the 2006 international conference on Emerging Directions in Embedded and Ubiquitous Computing

Quantified Score

Hi-index	14.99

Visualization

Abstract

Abstract--All contemporary dynamically scheduled processors support register renaming to cope with false data dependencies. One of the ways to implement register renaming is to use the slots within the Reorder Buffer (ROB) as physical registers. In such designs, the ROB is a large multiported structure that occupies a significant portion of the die area and dissipates a sizable fraction of the total chip power. The heavily ported ROB is also likely to have a large delay that can limit the processor clock rate. We consider several approaches for reducing the ROB complexity in processors that use the ROB slots to implement physical registers. The first approach exploits the fact that the bulk of the source operand reads are satisfied through forwarding or reading of the committed register values. Our technique completely eliminates the read ports needed on the ROB for reading source operands. A small set of associatively addressed retention latches is used to compensate for the resulting performance degradation by caching the most recently produced results. The second technique relies on a distributed implementation that spreads the centralized ROB structure across the function units (FUs), with each distributed component sized to match the FU workload and with one write port and two read ports on each component. The third approach combines the use of retention latches and a distributed ROB implementation that uses minimally ported distributed components. The net result of combining the two techniques is the ROB distribution with minimal conflicts over the read and no conflicts over the write ports. Our designs are evaluated using the simulation of SPEC 2000 benchmarks and measurements of the actual ROB layouts in a 0.18 micron CMOS process.