Register write specialization register read specialization: a path to complexity-effective wide-issue superscalar processors

Authors:
André Seznec;Eric Toullec;Olivier Rochecouste
Affiliations:
IRISA/INRIA, Rennes, France;IRISA/INRIA, Rennes, France;IRISA/INRIA, Rennes, France
Venue:
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Year:
2002

Citing 15
Cited 16

A VLIW architecture for a trace scheduling compiler

ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
The multicluster architecture: reducing cycle time through partitioning

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Pipeline gating: speculation control for energy reduction

Proceedings of the 25th annual international symposium on Computer architecture
The energy complexity of register files

ISLPED '98 Proceedings of the 1998 international symposium on Low power electronics and design
Lx: a technology platform for customizable VLIW embedded processing

Proceedings of the 27th annual international symposium on Computer architecture
Multiple-banked register file architectures

Proceedings of the 27th annual international symposium on Computer architecture
On pipelining dynamic instruction scheduling logic

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Energy-effective issue logic

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Efficient dynamic scheduling through tag elimination

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Design tradeoffs for the Alpha EV8 conditional branch predictor

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Select-free instruction scheduling logic

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Reducing the complexity of the register file in dynamic superscalar processors

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
The Alpha 21264 Microprocessor

IEEE Micro
A three dimensional register file for superscalar processors

HICSS '95 Proceedings of the 28th Hawaii International Conference on System Sciences

Banked multiported register files for high-frequency superscalar microprocessors

Proceedings of the 30th annual international symposium on Computer architecture
Cyclone: a broadcast-free dynamic instruction scheduler with selective replay

Proceedings of the 30th annual international symposium on Computer architecture
A Content Aware Integer Register File Organization

Proceedings of the 31st annual international symposium on Computer architecture
Inherently Workload-Balanced Clustered Microarchitecture

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
A Speculative Control Scheme for an Energy-Efficient Banked Register File

IEEE Transactions on Computers
An asymmetric clustered processor based on value content

Proceedings of the 19th annual international conference on Supercomputing
A case for a complexity-effective, width-partitioned microarchitecture

ACM Transactions on Architecture and Code Optimization (TACO)
Register port complexity reduction in wide-issue processors with selective instruction execution

Microprocessors & Microsystems
Future ILP processors

International Journal of High Performance Computing and Networking
Improving performance and reducing energy-delay with adaptive resource resizing for out-of-order embedded processors

Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on Languages, compilers, and tools for embedded systems
Achieving Out-of-Order Performance with Almost In-Order Complexity

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Simultaneous resource binding and interconnection optimization based on a distributed register-file microarchitecture

ACM Transactions on Design Automation of Electronic Systems (TODAES)
A Multi-Shared Register File Structure for VLIW Processors

Journal of Signal Processing Systems
Decoupled state-execute architecture

ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
CRAM: coded registers for amplified multiporting

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
An optimized front-end physical register file with banking and writeback filtering

PACS'04 Proceedings of the 4th international conference on Power-Aware Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the continuous shrinking of transistor size, processor designers are facing new difficulties to achieve high clock frequency. The register file read time, the wake up and selection logic traversal delay and the bypass network transit delay with also their respective power consumptions constitute major difficulties for the design of wide issue superscalar processors.In this paper, we show that transgressing a rule, that has so far been applied in the design of all the superscalar processors, allows to reduce these difficulties. Currently used general-purpose ISAs feature a single logical register file (and generally a floating-point register file). Up to now all superscalar processors have allowed any general-purpose functional unit to read and write any physical general purpose register.First, we propose Register Write Specialization, i.e, forcing distinct groups of functional units to write only in distinct subsets of the physical register file, thus limiting the number of write ports on each individual register. Register Write Specialization significantly reduces the access time, the power consumption and the silicon area of the register file without impairing performance.Second, we propose to combine Register Write Specialization with Register Read Specialization for clustered superscalar processors. This limits the number of read ports on each individual register and simplifies both the wakeup logic and the bypass network. With a 8-way 4-cluster WSRS architecture, the complexities of the wake-up logic entry and bypass point are equivalent to the ones found with a conventional 4-way issue processor. More physical registers are needed in WSRS architectures. Nevertheless, using WSRS architecture allows a dramatic reduction of the total silicon area devoted to the physical register file (by a factor four to six). Its power consumption is more than halved and its read access time is shortened by one third. Some extra hardware and/or a few extra pipeline stages are needed for register renaming. WSRS architecture induces constraints on the policy for allocating instructions to clusters. However, performance of a 8-way 4-cluster WSRS architecture stands the comparison with the one of a conventional 8-way 4-cluster conventional superscalar processor.