A VLIW architecture for a trace scheduling compiler
ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
Complexity-effective superscalar processors
Proceedings of the 24th annual international symposium on Computer architecture
The multicluster architecture: reducing cycle time through partitioning
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Pipeline gating: speculation control for energy reduction
Proceedings of the 25th annual international symposium on Computer architecture
The energy complexity of register files
ISLPED '98 Proceedings of the 1998 international symposium on Low power electronics and design
Lx: a technology platform for customizable VLIW embedded processing
Proceedings of the 27th annual international symposium on Computer architecture
Multiple-banked register file architectures
Proceedings of the 27th annual international symposium on Computer architecture
On pipelining dynamic instruction scheduling logic
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Efficient dynamic scheduling through tag elimination
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Design tradeoffs for the Alpha EV8 conditional branch predictor
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Select-free instruction scheduling logic
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Reducing the complexity of the register file in dynamic superscalar processors
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
The Alpha 21264 Microprocessor
IEEE Micro
A three dimensional register file for superscalar processors
HICSS '95 Proceedings of the 28th Hawaii International Conference on System Sciences
Banked multiported register files for high-frequency superscalar microprocessors
Proceedings of the 30th annual international symposium on Computer architecture
Cyclone: a broadcast-free dynamic instruction scheduler with selective replay
Proceedings of the 30th annual international symposium on Computer architecture
A Content Aware Integer Register File Organization
Proceedings of the 31st annual international symposium on Computer architecture
Inherently Workload-Balanced Clustered Microarchitecture
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
A Speculative Control Scheme for an Energy-Efficient Banked Register File
IEEE Transactions on Computers
An asymmetric clustered processor based on value content
Proceedings of the 19th annual international conference on Supercomputing
A case for a complexity-effective, width-partitioned microarchitecture
ACM Transactions on Architecture and Code Optimization (TACO)
Register port complexity reduction in wide-issue processors with selective instruction execution
Microprocessors & Microsystems
International Journal of High Performance Computing and Networking
Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on Languages, compilers, and tools for embedded systems
Achieving Out-of-Order Performance with Almost In-Order Complexity
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
ACM Transactions on Design Automation of Electronic Systems (TODAES)
A Multi-Shared Register File Structure for VLIW Processors
Journal of Signal Processing Systems
Decoupled state-execute architecture
ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
CRAM: coded registers for amplified multiporting
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
An optimized front-end physical register file with banking and writeback filtering
PACS'04 Proceedings of the 4th international conference on Power-Aware Computer Systems
Hi-index | 0.00 |
With the continuous shrinking of transistor size, processor designers are facing new difficulties to achieve high clock frequency. The register file read time, the wake up and selection logic traversal delay and the bypass network transit delay with also their respective power consumptions constitute major difficulties for the design of wide issue superscalar processors.In this paper, we show that transgressing a rule, that has so far been applied in the design of all the superscalar processors, allows to reduce these difficulties. Currently used general-purpose ISAs feature a single logical register file (and generally a floating-point register file). Up to now all superscalar processors have allowed any general-purpose functional unit to read and write any physical general purpose register.First, we propose Register Write Specialization, i.e, forcing distinct groups of functional units to write only in distinct subsets of the physical register file, thus limiting the number of write ports on each individual register. Register Write Specialization significantly reduces the access time, the power consumption and the silicon area of the register file without impairing performance.Second, we propose to combine Register Write Specialization with Register Read Specialization for clustered superscalar processors. This limits the number of read ports on each individual register and simplifies both the wakeup logic and the bypass network. With a 8-way 4-cluster WSRS architecture, the complexities of the wake-up logic entry and bypass point are equivalent to the ones found with a conventional 4-way issue processor. More physical registers are needed in WSRS architectures. Nevertheless, using WSRS architecture allows a dramatic reduction of the total silicon area devoted to the physical register file (by a factor four to six). Its power consumption is more than halved and its read access time is shortened by one third. Some extra hardware and/or a few extra pipeline stages are needed for register renaming. WSRS architecture induces constraints on the policy for allocating instructions to clusters. However, performance of a 8-way 4-cluster WSRS architecture stands the comparison with the one of a conventional 8-way 4-cluster conventional superscalar processor.