Complexity Effective Bypass Networks

Authors:
Aneesh Aggarwal
Affiliations:
Department of Electrical and Computer Engineering, Binghamton University, Binghamton, NY 13902
Venue:
Transactions on High-Performance Embedded Architectures and Compilers II
Year:
2009

Citing 21
Cited 0

The performance impact of incomplete bypassing in processor pipelines

Proceedings of the 28th annual international symposium on Microarchitecture
The M-Machine multicomputer

Proceedings of the 28th annual international symposium on Microarchitecture
Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
Trace processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The multicluster architecture: reducing cycle time through partitioning

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Power considerations in the design of the Alpha 21264 microprocessor

DAC '98 Proceedings of the 35th annual Design Automation Conference
Reducing power in high-performance microprocessors

DAC '98 Proceedings of the 35th annual Design Automation Conference
Clock rate versus IPC: the end of the road for conventional microarchitectures

Proceedings of the 27th annual international symposium on Computer architecture
On pipelining dynamic instruction scheduling logic

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Instruction distribution heuristics for quad-cluster, dynamically-scheduled, superscalar processors

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Increasing processor performance by implementing deeper pipelines

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A design space evaluation of grid processor architectures

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Select-free instruction scheduling logic

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Computer architecture: a quantitative approach

Computer architecture: a quantitative approach
Baring It All to Software: Raw Machines

Computer
Efficient Interconnects for Clustered Microarchitectures

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
The Alpha 21264: A 500 MHz Out-of-Order Execution Microprocessor

COMPCON '97 Proceedings of the 42nd IEEE International Computer Conference
Instruction Replication: Reducing Delays Due to Inter-PE Communication Latency

Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
Routed Inter-ALU Networks for ILP Scalability and Performance

ICCD '03 Proceedings of the 21st International Conference on Computer Design
The engineering design of the stretch computer

IRE-AIEE-ACM '59 (Eastern) Papers presented at the December 1-3, 1959, eastern joint IRE-AIEE-ACM computer conference
Single FU bypass networks for high clock rate superscalar processors

HiPC'04 Proceedings of the 11th international conference on High Performance Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Superscalar processors depend heavily on broadcast-based bypass networks to improve performance by exploiting more instruction level parallelism. However, increasing clock speeds and shrinking technology make broadcasting slower and difficult to implement, especially for wide issue and deeply pipelined processors. High latency bypass networks delay the execution of dependent instructions, which could result in significant performance loss. In this paper, we first perform a detailed analysis of the performance impact due to delays in the execution of dependent instructions caused by high latency bypass networks. We found that the performance impact due to delayed data-dependent instruction execution varies based on the data dependence present in a program and on the type of instructions constituting the program code. We also found that the performance impact varies significantly with the hardware configuration, and that with a high latency bypass network, the processor hardware critical for near-maximal performance reduces considerably. We then propose Single FU bypass networks to reduce the bypass network latency, where results from an FU are forwarded only to itself. The new bypass network design is based on the observations that an instruction's result is mostly required by just one other instruction and that the operands of many instructions come from a single other instruction. The new bypass network results in significant reduction in the data forwarding latency, while incurring only a small impact (about 2% for most of the SPEC2K benchmarks) on the instructions per cycle (IPC) count. However, reduced bypass latency can potentially increase the clock speed. Single FU bypass networks are also much more scalable than the broadcast-based bypass networks, for more wide and more deeply pipelined future microprocessors.