Highly concurrent scalar processing
ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Code duplication: an assist for global instruction scheduling
MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
MOVE: a framework for high-performance processor design
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Partitioned register files for VLIWs: a preliminary analysis of tradeoffs
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Register allocation with instruction scheduling
PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
The superblock: an effective technique for VLIW and superscalar compilation
The Journal of Supercomputing - Special issue on instruction-level parallelism
Transport-Triggering versus Operation-Triggering
CC '94 Proceedings of the 5th International Conference on Compiler Construction
Exploiting short-lived variables in superscalar processors
Proceedings of the 28th annual international symposium on Microarchitecture
Partitioned register file for TTAs
Proceedings of the 28th annual international symposium on Microarchitecture
ShiftQ: a bufferred interconnect for custom loop accelerators
CASES '01 Proceedings of the 2001 international conference on Compilers, architecture, and synthesis for embedded systems
Proceedings of the 30th annual international symposium on Computer architecture
Impact of Software Bypassing on Instruction Level Parallelism and Register File Traffic
SAMOS '08 Proceedings of the 8th international workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
Programmable and Scalable Architecture for Graphics Processing Units
SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
Shared-port register file architecture for low-energy VLIW processors
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
Exploitation of large amounts of instruction level parallelism requires a large amount of connectivity between the shared register file and the function units; this connectivity is expensive and increases the cycle time.This paper shows that the new class of transport triggered architectures requires fewer ports on the shared register file than traditional operation triggered architectures. This is achieved by programming data-transports instead of operations.Experiments with our extended basic block scheduler have shown that the reduction of the required number of register file ports is substantial. The average requirement for scalar applications is 0.50 read and 0.35 write ports per operation instead of 2 read and 1 write ports. Due to this reduction it is possible to execute 2 operations per cycle with a two-ported register file and 3.6 operations per cycle with a six-ported register file.