The performance impact of incomplete bypassing in processor pipelines
Proceedings of the 28th annual international symposium on Microarchitecture
Proceedings of the 28th annual international symposium on Microarchitecture
Complexity-effective superscalar processors
Proceedings of the 24th annual international symposium on Computer architecture
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The multicluster architecture: reducing cycle time through partitioning
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Power considerations in the design of the Alpha 21264 microprocessor
DAC '98 Proceedings of the 35th annual Design Automation Conference
Reducing power in high-performance microprocessors
DAC '98 Proceedings of the 35th annual Design Automation Conference
Clock rate versus IPC: the end of the road for conventional microarchitectures
Proceedings of the 27th annual international symposium on Computer architecture
On pipelining dynamic instruction scheduling logic
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Instruction distribution heuristics for quad-cluster, dynamically-scheduled, superscalar processors
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Increasing processor performance by implementing deeper pipelines
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A design space evaluation of grid processor architectures
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Select-free instruction scheduling logic
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Computer architecture: a quantitative approach
Computer architecture: a quantitative approach
Efficient Interconnects for Clustered Microarchitectures
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
The Alpha 21264: A 500 MHz Out-of-Order Execution Microprocessor
COMPCON '97 Proceedings of the 42nd IEEE International Computer Conference
Instruction Replication: Reducing Delays Due to Inter-PE Communication Latency
Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
Routed Inter-ALU Networks for ILP Scalability and Performance
ICCD '03 Proceedings of the 21st International Conference on Computer Design
The engineering design of the stretch computer
IRE-AIEE-ACM '59 (Eastern) Papers presented at the December 1-3, 1959, eastern joint IRE-AIEE-ACM computer conference
Complexity Effective Bypass Networks
Transactions on High-Performance Embedded Architectures and Compilers II
Hi-index | 0.00 |
Microprocessors depend heavily on broadcast-based bypass networks, to eliminate pipeline hazards arising due to data dependencies However, even though bypassing is logically simple, increasing clock speeds make broadcasting slower and difficult to implement, especially for wide issue and deeply pipelined processors The problem is exacerbated by shrinking feature size, as wire delays become more important than the gate delays. In this paper, we propose Single FU bypass networks for high clock rate superscalar processors where, instead of a fully connected broadcast-based bypass network, results from an FU are forwarded only to itself The new bypass network design is based on the observations that a result produced by an instruction is mostly required by just one other instruction and that the operands of many instructions come from a single other instruction The new bypass network results in significant reduction in the data forwarding latency, while incurring only a small impact (about 2% for most of the SPEC2K benchmarks) on the instructions per cycle (IPC) count However, reduced bypass latency has a high potential for increased clock speeds Single FU bypass networks are also much more scalable than the broadcast-based bypass networks, for more wide and more deeply pipelined future microprocessors.