Instruction Replication for Clustered Microarchitectures

Authors:
Alex Aletà;Josep M. Codina;Antonio González;David Kaeli
Affiliations:
Dep. of Computer Architecture, UPC, Barcelona, Spain;Dep. of Computer Architecture, UPC, Barcelona, Spain;Dep. of Computer Architecture, UPC, Barcelona, Spain and Intel Barcelona Research Center, Intel Labs, UPC, Barcelona, Spain;Northeastern University, Boston, MA
Venue:
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Year:
2003

Citing 12
Cited 8

Rematerialization

PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Analysis of multilevel graph partitioning

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Lx: a technology platform for customizable VLIW embedded processing

Proceedings of the 27th annual international symposium on Computer architecture
A comparative study of modulo scheduling techniques

ICS '02 Proceedings of the 16th international conference on Supercomputing
Graph-partitioning based instruction scheduling for clustered processors

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
The TigerSHARC DSP Architecture

IEEE Micro
Grain Size Determination for Parallel Processing

IEEE Software
Exploiting Pseudo-Schedules to Guide Data Dependence Graph Partitioning

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing

MICRO 14 Proceedings of the 14th annual workshop on Microprogramming
The Effectiveness of Loop Unrolling for Modulo Scheduling in Clustered VLIW Architectures

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Swing Modulo Scheduling: A Lifetime-Sensitive Approach

PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
Instruction Replication: Reducing Delays Due to Inter-PE Communication Latency

Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques

A Complexity-Effective Approach to ALU Bandwidth Enhancement for Instruction-Level Temporal Redundancy

Proceedings of the 31st annual international symposium on Computer architecture
Removing communications in clustered microarchitectures through instruction replication

ACM Transactions on Architecture and Code Optimization (TACO)
A Dependency Chain Clustered Microarchitecture

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
A Distributed Control Path Architecture for VLIW Processors

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Instruction Replication for Reducing Delays Due to Inter-PE Communication Latency

IEEE Transactions on Computers
A Criticality Analysis of Clustering in Superscalar Processors

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Inter-cluster communication in VLIW architectures

ACM Transactions on Architecture and Code Optimization (TACO)
Virtual Cluster Scheduling Through the Scheduling Graph

Proceedings of the International Symposium on Code Generation and Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

This work presents a new compilation technique that usesinstruction replication in order to reduce the number ofcommunications executed on a clusteredmicroarchitecture. For such architectures, the need tocommunicate values between clusters can result in asignificant performance loss. Inter-clustercommunications can be reduced by selectively replicatingan appropriate set of instructions. However, instructionreplication must be done carefully since it may alsodegrade performance due to the increased contention itcan place on processor resources. The proposed schemeis built on top of a previously proposed state-of-the-artmodulo scheduling algorithm that effectively reducescommunications. Results show that the number ofcommunications can decrease using replication, whichresults in significant speed-ups. IPC is increased by 25%on average for a 4-cluster microarchitecture and by asmuch as 70% for selected programs.