Instruction Replication for Reducing Delays Due to Inter-PE Communication Latency

Authors:
Aneesh Aggarwal;Manoj Franklin
Affiliations:
IEEE Computer Society;IEEE Computer Society
Venue:
IEEE Transactions on Computers
Year:
2005

Citing 16
Cited 2

Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
DataScalar architectures

Proceedings of the 24th annual international symposium on Computer architecture
Trace processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The multicluster architecture: reducing cycle time through partitioning

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Exploiting idle floating-point resources for integer execution

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Clock rate versus IPC: the end of the road for conventional microarchitectures

Proceedings of the 27th annual international symposium on Computer architecture
Reducing wire delay penalty through value prediction

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Instruction distribution heuristics for quad-cluster, dynamically-scheduled, superscalar processors

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Dynamic Code Partitioning for Clustered Architectures

International Journal of Parallel Programming
Billion-Transistor Architectures

Computer
The MIPS R10000 Superscalar Microprocessor

IEEE Micro
Efficient Interconnects for Clustered Microarchitectures

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
The Alpha 21264: A 500 MHz Out-of-Order Execution Microprocessor

COMPCON '97 Proceedings of the 42nd IEEE International Computer Conference
Dynamically managing the communication-parallelism trade-off in future clustered processors

Proceedings of the 30th annual international symposium on Computer architecture
CARS: A New Code Generation Framework for Clustered ILP Processors

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Instruction Replication for Clustered Microarchitectures

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture

Replication-based partial dynamic scheduling on heterogeneous network processors

APPT'07 Proceedings of the 7th international conference on Advanced parallel processing technologies
Compiler supports for VLIW DSP processors with SIMD intrinsics

Concurrency and Computation: Practice & Experience

Quantified Score

Hi-index	14.98

Visualization

Abstract

As feature sizes are becoming smaller, wire delays are becoming very critical. Clustering is a popular decentralization approach to reduce the impact of shrinking technologies on clock speed. In this approach, the centralized instruction window is replaced with multiple smaller windows, called clusters (PEs). The performance of these clustered processors depends on the amount of inter-PE communication and load imbalance incurred by the distribution algorithm used to distribute instructions among the PEs. In this paper, we investigate a novel approach of reducing the impact of inter-PE communication latency, while preserving good load balance. The basic idea is to selectively replicate instructions in those PEs where their results are required. The replication is done based on heuristics that weigh the potential benefits of replication. We found that, with instruction replication, the IPC of a clustered processor is significantly higher than that obtained without instruction replication and is within just 8 percent of that of a superscalar configuration with a centralized instruction scheduler.