Highly concurrent scalar processing
ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
The superblock: an effective technique for VLIW and superscalar compilation
The Journal of Supercomputing - Special issue on instruction-level parallelism
Partitioned register file for TTAs
Proceedings of the 28th annual international symposium on Microarchitecture
Simulation/evaluation environment for a VLIW processor architecture
IBM Journal of Research and Development - Special issue: performance analysis and its impact on design
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Space-time scheduling of instruction-level parallelism on a raw machine
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Lx: a technology platform for customizable VLIW embedded processing
Proceedings of the 27th annual international symposium on Computer architecture
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
High-quality operation binding for clustered VLIW datapaths
Proceedings of the 38th annual Design Automation Conference
Modulo scheduling with integrated register spilling for clustered VLIW architectures
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
A Method for Register Allocation to Loops in Multiple Register File Architectures
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
A Register File Architecture and Compilation Scheme for Clustered ILP Processors
Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
An Architectural Overview of the Programmable Multimedia Processor, TM-1
COMPCON '96 Proceedings of the 41st IEEE International Computer Conference
Treegion Scheduling for Wide Issue Processors
HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
Inter-Cluster Communication Models for Clustered VLIW Processors
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
TriMedia CPU64 Application Domain and Benchmark Suite
ICCD '99 Proceedings of the 1999 IEEE International Conference on Computer Design
CARS: A New Code Generation Framework for Clustered ILP Processors
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Implementing an experimental VLIW compiler
WCAE-3 '97 Proceedings of the 1997 workshop on Computer architecture education
Compiler-directed Data Partitioning for Multicluster Processors
Proceedings of the International Symposium on Code Generation and Optimization
Inter-cluster communication in VLIW architectures
ACM Transactions on Architecture and Code Optimization (TACO)
Virtual Cluster Scheduling Through the Scheduling Graph
Proceedings of the International Symposium on Code Generation and Optimization
Variable Partitioning and Scheduling for MPSoC with Virtually Shared Scratch Pad Memory
Journal of Signal Processing Systems
SCRF: a hybrid register file architecture
PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies
Hi-index | 0.00 |
In this paper high-level language (HLL) variables that are alive in a whole HLL function, across multiple scheduling units, are termed as global values. Due to their long live ranges and, hence, large impact on the schedule, the global values require different compiler optimizations than local values, which span across only one scheduling unit. The instruction scheduler for a clustered ILP processor, which is responsible for cluster assignment of operations and variables, faces a difficult problem of assigning global values to clusters. Our study shows that trivial assignments (e.g. mapping all global values into one cluster) may result in a severe cycle count overhead relative to the unicluster of up to 26.3% for a four cluster VLIW machine. This paper presents three advanced algorithms for assigning global values to clusters based on multi-pass scheduling and affinity of variables. Furthermore, we measure performance of these algorithms on optimized multimedia C applications and assess quality of our algorithms by comparing them to a practical higher performance bound derived from a vast random search. Our algorithms reduce the execution time overhead of the best simple algorithm round-robin from 10.5% to 5.9% for the two cluster VLIW machine and from 17.3% to 14.12% for the four cluster VLIW machine.