Improving Memory Hierarchy Performance through Combined Loop Interchange and Multi-Level Fusion

Authors:
Qing Yi;Ken Kennedy
Affiliations:
COMPUTER SCIENCE DEPARTMENT, RICE UNIVERSITY MS-132, HOUSTON, USA;COMPUTER SCIENCE DEPARTMENT, RICE UNIVERSITY MS-132, HOUSTON, USA
Venue:
International Journal of High Performance Computing Applications
Year:
2004

Citing 24
Cited 3

Strategies for cache and local memory management by global program transformation

Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
More iteration space tiling

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Uniform techniques for loop optimization

ICS '91 Proceedings of the 5th international conference on Supercomputing
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Improving the ratio of memory operations to floating-point operations in loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
The Omega Library interface guide

The Omega Library interface guide
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Transitive closure of infinite graphs and its applications

International Journal of Parallel Programming - Special issue: selected papers from the eighth international workshop on languages and compilers for parallel computing
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
Fusion of Loops for Parallelism and Locality

IEEE Transactions on Parallel and Distributed Systems
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
An affine partitioning algorithm to maximize parallelism and minimize communication

ICS '99 Proceedings of the 13th international conference on Supercomputing
Fast greedy weighted fusion

Proceedings of the 14th international conference on Supercomputing
Synthesizing transformations for locality enhancement of imperfectly-nested loop nests

Proceedings of the 14th international conference on Supercomputing
Transforming loops to recursion for multi-level memory hierarchies

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
On the complexity of loop fusion

Parallel Computing - Special issue on new trends on scheduling in parallel and distributed systems
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
Optimizing Supercompilers for Supercomputers

Optimizing Supercompilers for Supercomputers
Dependence Analysis for Supercomputing

Dependence Analysis for Supercomputing
Collective Loop Fusion for Array Contraction

Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
Fine-grained analysis of array computations

Fine-grained analysis of array computations
Transforming complex loop nests for locality

Transforming complex loop nests for locality

DESOLA: An active linear algebra library using delayed evaluation and runtime code generation

Science of Computer Programming
Loop Distribution and Fusion with Timing and Code Size Optimization

Journal of Signal Processing Systems
Optimizing integrated application performance with cache-aware metascheduling

OTM'11 Proceedings of the 2011th Confederated international conference on On the move to meaningful internet systems - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

Because of the increasing gap between the speeds of processors and main memories, compilers must enhance the locality of applications to achieve high performance. Loop fusion enhances locality by fusing loops that access similar sets of data. Typically, it is applied to loops at the same level after loop interchange, which first attains the best nesting order for each local loop nest. However, since loop interchange cannot foresee the overall optimization effect, it often selects the wrong loops to be placed outermost for fusion, achieving suboptimal performance globally. Building on traditional unimodular transformations on perfectly nested loops, we present a novel transformation, dependence hoisting, that effectively combines interchange and fusion for arbitrarily nested loops. We present techniques to simultaneously interchange and fuse loops at multiple levels. By evaluating the compound optimization effect beforehand, we have achieved better performance than was possible by previous techniques, which apply interchange and fusion separately.