Bulldog: a compiler for VLSI architectures
Bulldog: a compiler for VLSI architectures
Instruction scheduling in the TOBEY compiler
IBM Journal of Research and Development
A multilevel algorithm for partitioning graphs
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Effective cluster assignment for modulo scheduling
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Space-time scheduling of instruction-level parallelism on a raw machine
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs
SIAM Journal on Scientific Computing
Instruction scheduling for clustered VLIW architectures
ISSS '00 Proceedings of the 13th international symposium on System synthesis
Cluster assignment for high-performance embedded VLIW processors
ACM Transactions on Design Automation of Electronic Systems (TODAES)
A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors
Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Efficient Interconnects for Clustered Microarchitectures
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Code Partitioning in Decoupled Compilers
Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Region-based hierarchical operation partitioning for multicluster processors
PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Instruction Scheduling for Clustered VLIW DSPs
PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
Integrated temporal and spatial scheduling for extended operand clustered VLIW processors
Proceedings of the 1st conference on Computing frontiers
Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures
Optimal Superblock Scheduling Using Enumeration
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Data-Dependency Graph Transformations for Instruction Scheduling
Journal of Scheduling
Scalability Aspects of Instruction Distribution Algorithms for Clustered Processors
IEEE Transactions on Parallel and Distributed Systems
Compiler-directed Data Partitioning for Multicluster Processors
Proceedings of the International Symposium on Code Generation and Optimization
A survey of research and practices of Network-on-chip
ACM Computing Surveys (CSUR)
Optimal integrated code generation for VLIW architectures: Research Articles
Concurrency and Computation: Practice & Experience - 10th International Workshop on Compilers for Parallel Computers (CPC 2003)
Data-Dependency Graph Transformations for Superblock Scheduling
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Handbook of Constraint Programming (Foundations of Artificial Intelligence)
Handbook of Constraint Programming (Foundations of Artificial Intelligence)
Inter-cluster communication in VLIW architectures
ACM Transactions on Architecture and Code Optimization (TACO)
Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Pragmatic integrated scheduling for clustered VLIW architectures
Software—Practice & Experience
An Application of Constraint Programming to Superblock Instruction Scheduling
CP '08 Proceedings of the 14th international conference on Principles and Practice of Constraint Programming
Integrated Modulo Scheduling for Clustered VLIW Architectures
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
AGAMOS: A Graph-Based Approach to Modulo Scheduling for Clustered Microarchitectures
IEEE Transactions on Computers
Learning Heuristics for the Superblock Instruction Scheduling Problem
IEEE Transactions on Knowledge and Data Engineering
FCCM: A Novel Inter-Core Communication Mechanism in Multi-Core Platform
ICISE '09 Proceedings of the 2009 First IEEE International Conference on Information Science and Engineering
A Constraint Programming Approach for Instruction Assignment
INTERACT '11 Proceedings of the 2011 15th Workshop on Interaction between Compilers and Computer Architectures
Optimal integrated VLIW code generation with integer linear programming
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Hi-index | 0.00 |
Many embedded processors use clustering to scale up instruction-level parallelism in a cost-effective manner. In a clustered architecture, the registers and functional units are partitioned into smaller units and clusters communicate through register-to-register copy operations. Texas Instruments, for example, has a series of architectures for embedded processors which are clustered. Such an architecture places a heavier burden on the compiler, which must now assign instructions to clusters (spatial scheduling), assign instructions to cycles (temporal scheduling), and schedule copy operations to move data between clusters. We consider instruction scheduling of local blocks of code on clustered architectures to improve performance. Scheduling for space and time is known to be a hard problem. Previous work has proposed greedy approaches based on list scheduling to simultaneously perform spatial and temporal scheduling and phased approaches based on first partitioning a block of code to do spatial assignment and then performing temporal scheduling. Greedy approaches risk making mistakes that are then costly to recover from, and partitioning approaches suffer from the well-known phase ordering problem. In this article, we present a constraint programming approach for scheduling instructions on clustered architectures. We employ a problem decomposition technique that solves spatial and temporal scheduling in an integrated manner. We analyze the effect of different hardware parameters—such as the number of clusters, issue-width, and intercluster communication cost—on application performance. We found that our approach was able to achieve an improvement of up to 26%, on average, over a state-of-the-art technique on superblocks from SPEC 2000 benchmarks.