Loop Transformations for Architectures with Partitioned Register Banks

Authors:
Xianglong Huang;Steve Carr;Philip Sweany
Affiliations:
University of Massachusetts-Amherst, MA;Michigan Technological University, Houghton, MI;Texas Instruments, Dallas, TX
Venue:
OM '01 Proceedings of the 2001 ACM SIGPLAN workshop on Optimization of middleware and distributed systems
Year:
2001

Citing 20
Cited 3

Automatic decomposition of scientific programs for parallel execution

POPL '87 Proceedings of the 14th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
Estimating interlock and improving balance for pipelined architectures

Journal of Parallel and Distributed Computing
Optimal loop parallelization

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Software pipelining: an effective scheduling technique for VLIW machines

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Coloring heuristics for register allocation

PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
Improving register allocation for subscripted variables

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
Partitioned register files for VLIWs: a preliminary analysis of tradeoffs

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Scalar replacement in the presence of conditional control flow

Software—Practice & Experience
Iterative modulo scheduling: an algorithm for software pipelining loops

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Improving the ratio of memory operations to floating-point operations in loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
Software pipelining

ACM Computing Surveys (CSUR)
An integrated compilation and performance analysis environment for data parallel programs

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Custom-fit processors: letting applications define architectures

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Effective cluster assignment for modulo scheduling

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Instruction scheduling for clustered VLIW architectures

ISSS '00 Proceedings of the 13th international symposium on System synthesis
Software Pipelining: Petri Net Pacemaker

PACT '93 Proceedings of the IFIP WG10.3. Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism
Global Register Partitioning

PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
Register Assignment for Software Pipelining with Partitioned Register Banks

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
The Effectiveness of Loop Unrolling for Modulo Scheduling in Clustered VLIW Architectures

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing

Loop fusion for clustered VLIW architectures

Proceedings of the joint conference on Languages, compilers and tools for embedded systems: software and compilers for embedded systems
Optimizing Loop Performance for Clustered VLIW Architectures

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Natural instruction level parallelism-aware compiler for high-performance QueueCore processor architecture

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Embedded systems require maximum performance from a processor within significant constraints in power consumption and chip cost. Using software pipelining, processors can often exploit considerable instruction-level parallelism (ILP), and thus significantly improve performance, at the cost of substantially increasing register requirements. These increasing register requirements, however, make it difficult to build a high-performance embedded processor with a single, multi-ported register file while maintaining clock speed and limiting power consumption.Some digital signal processors, such as the TI C6x, reduce the number of ports required for a register bank by partitioning the register bank into multiple banks. Disjoint subsets of functional units are directly connected to one of the partitioned register banks. Each register bank and its associate functional units is called a cluster. Clustering reduces the number of ports needed on a per-bank basis, allowing an increased clock rate. However, execution speed can be hampered because of the potential need to copy “non-local” operands among register banks in order to make them available to the functional unit performing an operation. The task of the compiler is to both maximize parallelism and minimize the number of remote register accesses needed.Previous work has concentrated on methods to partition virtual registers amongst the target architecture's clusters. In this paper, we show how high-level loop transformations can enhance the partitioning obtained by low-level schemes. In our experiments, loop transformations improved software pipelining by 27% on a machine with 2 clusters, each having 1 floating-point and 1 integer register bank and 4 functional units. We also observed a 20% improvement on a similar machine with 4 clusters of 2 functional units. In fact, by performing the described loop transformations we were able to show improvements of greater than 10% over schedules (for un-transformed loops) generated with the unrealistic assumption of a single multi-ported register bank.