Combining Loop Transformations Considering Caches and Scheduling

Authors:
Michael E. Wolf;Dror E. Maydan;Ding-Kai Chen
Affiliations:
-;-;-
Venue:
International Journal of Parallel Programming
Year:
1998

Citing 12
Cited 1

Compilers: principles, techniques, and tools

Compilers: principles, techniques, and tools
Estimating interlock and improving balance for pipelined architectures

Journal of Parallel and Distributed Computing
Strategies for cache and local memory management by global program transformation

Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Software pipelining: an effective scheduling technique for VLIW machines

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Efficient and exact data dependence analysis

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Compiler optimizations for improving data locality

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Improving the ratio of memory operations to floating-point operations in loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Software pipelining showdown: optimal vs. heuristic methods in a production compiler

PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
Combining Optimization for Cache and Instruction-Level Parallelism

PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques

Iterative compilation

Embedded processor design challenges

Quantified Score

Hi-index	0.00

Visualization

Abstract

The performance of modern microprocessors is greatly affected by cache behavior, instruction scheduling, register allocation and loop overhead. High-level loop transformations such as fission, fusion, tiling, interchanging and outer loop unrolling (e.g., unroll and jam) are well known to be capable of improving all these aspects of performance. Difficulties arise because these machine characteristics and these optimizations are highly interdependent. Interchanging two loops might, for example, improve cache behavior but make it impossible to allocate registers in the inner loop. Similarly, unrolling or interchanging a loop might individually hurt performance but doing both simultaneously might help performance. Little work has been published on how to combine these transformations into an efficient and effective compiler algorithm. In this paper, we present a model that estimates total machine cycle time taking into account cache misses, software pipelining, register pressure and loop overhead. We then develop an algorithm to intelligently search through the various, possible transformations, using our machine model to select the set of transformations leading to the best overall performance. We have implemented this algorithm as part of the MIPSPro commercial compiler system. We give experimental results showing that our approach is both effective and efficient in optimizing numerical programs.