The performance impact of incomplete bypassing in processor pipelines
Proceedings of the 28th annual international symposium on Microarchitecture
Proceedings of the 28th annual international symposium on Microarchitecture
Complexity-effective superscalar processors
Proceedings of the 24th annual international symposium on Computer architecture
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The multicluster architecture: reducing cycle time through partitioning
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Power considerations in the design of the Alpha 21264 microprocessor
DAC '98 Proceedings of the 35th annual Design Automation Conference
Reducing power in high-performance microprocessors
DAC '98 Proceedings of the 35th annual Design Automation Conference
Clock rate versus IPC: the end of the road for conventional microarchitectures
Proceedings of the 27th annual international symposium on Computer architecture
On pipelining dynamic instruction scheduling logic
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Instruction distribution heuristics for quad-cluster, dynamically-scheduled, superscalar processors
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Increasing processor performance by implementing deeper pipelines
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A design space evaluation of grid processor architectures
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Select-free instruction scheduling logic
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Computer architecture: a quantitative approach
Computer architecture: a quantitative approach
Efficient Interconnects for Clustered Microarchitectures
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
The Alpha 21264: A 500 MHz Out-of-Order Execution Microprocessor
COMPCON '97 Proceedings of the 42nd IEEE International Computer Conference
Instruction Replication: Reducing Delays Due to Inter-PE Communication Latency
Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
Routed Inter-ALU Networks for ILP Scalability and Performance
ICCD '03 Proceedings of the 21st International Conference on Computer Design
The engineering design of the stretch computer
IRE-AIEE-ACM '59 (Eastern) Papers presented at the December 1-3, 1959, eastern joint IRE-AIEE-ACM computer conference
Single FU bypass networks for high clock rate superscalar processors
HiPC'04 Proceedings of the 11th international conference on High Performance Computing
Hi-index | 0.00 |
Superscalar processors depend heavily on broadcast-based bypass networks to improve performance by exploiting more instruction level parallelism. However, increasing clock speeds and shrinking technology make broadcasting slower and difficult to implement, especially for wide issue and deeply pipelined processors. High latency bypass networks delay the execution of dependent instructions, which could result in significant performance loss. In this paper, we first perform a detailed analysis of the performance impact due to delays in the execution of dependent instructions caused by high latency bypass networks. We found that the performance impact due to delayed data-dependent instruction execution varies based on the data dependence present in a program and on the type of instructions constituting the program code. We also found that the performance impact varies significantly with the hardware configuration, and that with a high latency bypass network, the processor hardware critical for near-maximal performance reduces considerably. We then propose Single FU bypass networks to reduce the bypass network latency, where results from an FU are forwarded only to itself. The new bypass network design is based on the observations that an instruction's result is mostly required by just one other instruction and that the operands of many instructions come from a single other instruction. The new bypass network results in significant reduction in the data forwarding latency, while incurring only a small impact (about 2% for most of the SPEC2K benchmarks) on the instructions per cycle (IPC) count. However, reduced bypass latency can potentially increase the clock speed. Single FU bypass networks are also much more scalable than the broadcast-based bypass networks, for more wide and more deeply pipelined future microprocessors.