Removing communications in clustered microarchitectures through instruction replication

Authors:
Alex Aletà;Josep M. Codina;Antonio González;David Kaeli
Affiliations:
UPC, Barcelona, Spain;UPC, Barcelona, Spain;Intel Labs Barcelona, UPC, Barcelona, Spain;Northeastern University, Boston, MA
Venue:
ACM Transactions on Architecture and Code Optimization (TACO)
Year:
2004

Citing 17
Cited 6

Rematerialization

PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Analysis of multilevel graph partitioning

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Effective cluster assignment for modulo scheduling

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Lx: a technology platform for customizable VLIW embedded processing

Proceedings of the 27th annual international symposium on Computer architecture
A comparative study of modulo scheduling techniques

ICS '02 Proceedings of the 16th international conference on Supercomputing
Graph-partitioning based instruction scheduling for clustered processors

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Modulo scheduling with integrated register spilling for clustered VLIW architectures

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
The TigerSHARC DSP Architecture

IEEE Micro
Grain Size Determination for Parallel Processing

IEEE Software
A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Exploiting Pseudo-Schedules to Guide Data Dependence Graph Partitioning

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing

MICRO 14 Proceedings of the 14th annual workshop on Microprogramming
Distributed Modulo Scheduling

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
The Effectiveness of Loop Unrolling for Modulo Scheduling in Clustered VLIW Architectures

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Swing Modulo Scheduling: A Lifetime-Sensitive Approach

PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
Instruction Replication: Reducing Delays Due to Inter-PE Communication Latency

Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
Instruction Replication for Clustered Microarchitectures

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture

Placement-and-routing-based register allocation for coarse-grained reconfigurable arrays

Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on Languages, compilers, and tools for embedded systems
The Instruction-Set Extension Problem: A Survey

ARC '08 Proceedings of the 4th international workshop on Reconfigurable Computing: Architectures, Tools and Applications
The Instruction-Set Extension Problem: A Survey

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Communication-free data alignment for arrays with exponential references in parallelizing compilers for scalable parallel systems

The Journal of Supercomputing
Compiler supports for VLIW DSP processors with SIMD intrinsics

Concurrency and Computation: Practice & Experience
SCRF: a hybrid register file architecture

PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

The need to communicate values between clusters can result in a significant performance loss for clustered microarchitectures. In this work, we describe an optimization technique that removes communications by selectively replicating an appropriate set of instructions. Instruction replication is done carefully because it might degrade performance due to the increased contention it can place on processor resources. The proposed scheme is built on top of a previously proposed state-of-the-art modulo-scheduling algorithm. Though this algorithm has been proved to be very effective at reducing communications, results show that the number of communications can be further decreased by around one-third through replication, which results in a significant speedup. IPC is increased by 25% on average for a four-cluster microarchitecture and by as much as 70% for selected programs. We also show that replicating appropriate sets of instructions is more effective than doubling the intercluster connection network bandwidth.