PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Analysis of multilevel graph partitioning
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Lx: a technology platform for customizable VLIW embedded processing
Proceedings of the 27th annual international symposium on Computer architecture
A comparative study of modulo scheduling techniques
ICS '02 Proceedings of the 16th international conference on Supercomputing
Graph-partitioning based instruction scheduling for clustered processors
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
The TigerSHARC DSP Architecture
IEEE Micro
Grain Size Determination for Parallel Processing
IEEE Software
Exploiting Pseudo-Schedules to Guide Data Dependence Graph Partitioning
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
MICRO 14 Proceedings of the 14th annual workshop on Microprogramming
The Effectiveness of Loop Unrolling for Modulo Scheduling in Clustered VLIW Architectures
ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Swing Modulo Scheduling: A Lifetime-Sensitive Approach
PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
Instruction Replication: Reducing Delays Due to Inter-PE Communication Latency
Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
Proceedings of the 31st annual international symposium on Computer architecture
Removing communications in clustered microarchitectures through instruction replication
ACM Transactions on Architecture and Code Optimization (TACO)
A Dependency Chain Clustered Microarchitecture
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
A Distributed Control Path Architecture for VLIW Processors
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Instruction Replication for Reducing Delays Due to Inter-PE Communication Latency
IEEE Transactions on Computers
A Criticality Analysis of Clustering in Superscalar Processors
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Inter-cluster communication in VLIW architectures
ACM Transactions on Architecture and Code Optimization (TACO)
Virtual Cluster Scheduling Through the Scheduling Graph
Proceedings of the International Symposium on Code Generation and Optimization
Hi-index | 0.00 |
This work presents a new compilation technique that usesinstruction replication in order to reduce the number ofcommunications executed on a clusteredmicroarchitecture. For such architectures, the need tocommunicate values between clusters can result in asignificant performance loss. Inter-clustercommunications can be reduced by selectively replicatingan appropriate set of instructions. However, instructionreplication must be done carefully since it may alsodegrade performance due to the increased contention itcan place on processor resources. The proposed schemeis built on top of a previously proposed state-of-the-artmodulo scheduling algorithm that effectively reducescommunications. Results show that the number ofcommunications can decrease using replication, whichresults in significant speed-ups. IPC is increased by 25%on average for a 4-cluster microarchitecture and by asmuch as 70% for selected programs.