Transforming Complex Loop Nests for Locality

Authors:
Qing Yi;Ken Kennedy;Vikram Adve
Affiliations:
Rice University, 6100 Main Street MS-132, Houston, TX 77005;Rice University, 6100 Main Street MS-132, Houston, TX 77005;University of Illinois at Urbana-Champaign, 1304 W. Springfield Ave, Urbana, IL 61801
Venue:
The Journal of Supercomputing
Year:
2004

Citing 29
Cited 11

Strategies for cache and local memory management by global program transformation

Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
More iteration space tiling

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Uniform techniques for loop optimization

ICS '91 Proceedings of the 5th international conference on Supercomputing
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Optimizing for parallelism and data locality

ICS '92 Proceedings of the 6th international conference on Supercomputing
Compiler blockability of numerical algorithms

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Improving the ratio of memory operations to floating-point operations in loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
The Omega Library interface guide

The Omega Library interface guide
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Transitive closure of infinite graphs and its applications

International Journal of Parallel Programming - Special issue: selected papers from the eighth international workshop on languages and compilers for parallel computing
Fusion of Loops for Parallelism and Locality

IEEE Transactions on Parallel and Distributed Systems
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Compiler blockability of dense matrix factorizations

ACM Transactions on Mathematical Software (TOMS)
Data transformations for eliminating conflict misses

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
An affine partitioning algorithm to maximize parallelism and minimize communication

ICS '99 Proceedings of the 13th international conference on Supercomputing
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
Fast greedy weighted fusion

Proceedings of the 14th international conference on Supercomputing
Synthesizing transformations for locality enhancement of imperfectly-nested loop nests

Proceedings of the 14th international conference on Supercomputing
Transforming loops to recursion for multi-level memory hierarchies

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Automatic loop interchange

SIGPLAN '84 Proceedings of the 1984 SIGPLAN symposium on Compiler construction
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
Optimizing Supercompilers for Supercomputers

Optimizing Supercompilers for Supercomputers
Dependence Analysis for Supercomputing

Dependence Analysis for Supercomputing
Hierarchical tiling for improved superscalar performance

IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
Quantifying the Multi-level Nature of Tiling Interactions

LCPC '97 Proceedings of the 10th International Workshop on Languages and Compilers for Parallel Computing
Improving effective bandwidth through compiler enhancement of global and dynamic cache reuse

Improving effective bandwidth through compiler enhancement of global and dynamic cache reuse
Fine-grained analysis of array computations

Fine-grained analysis of array computations

Automatic blocking of QR and LU factorizations for locality

MSP '04 Proceedings of the 2004 workshop on Memory system performance
A practical automatic polyhedral parallelizer and locality optimizer

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Automated transformation for performance-critical kernels

LCSD '07 Proceedings of the 2007 Symposium on Library-Centric Software Design
Parametric multi-level tiling of imperfectly nested loops

Proceedings of the 23rd international conference on Supercomputing
Parameterized tiling revisited

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Loop Distribution and Fusion with Timing and Code Size Optimization

Journal of Signal Processing Systems
Practical loop transformations for tensor contraction expressions on multi-level memory hierarchies

CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Applying data copy to improve memory performance of general array computations

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Applying loop optimizations to object-oriented abstractions through general classification of array semantics

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Automated programmable control and parameterization of compiler optimizations

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
POET: a scripting language for applying parameterized source-to-source program transformations

Software—Practice & Experience

Quantified Score

Hi-index	0.00

Visualization

Abstract

Over the past 20 years, increases in processor speed have dramatically outstripped performance increases for standard memory chips. To bridge this gap, compilers must optimize applications so that data fetched into caches are reused before being displaced. Existing compiler techniques can efficiently optimize simple loop structures such as sequences of perfectly nested loops. However, on more complicated structures, existing techniques are either ineffective or require too much computation time to be practical for a commercial compiler. To optimize complex loop structures both effectively and inexpensively, we present a novel loop transformation, dependence hoisting, for optimizing arbitrarily nested loops, and an efficient framework that applies the new technique to aggressively optimize benchmarks for better locality. Our technique is as inexpensive as the traditional unimodular loop transformation techniques and thus can be incorporated into commercial compilers. In addition, it is highly effective and is able to block several linear algebra kernels containing highly challenging loop structures, in particular, Cholesky, QR, LU factorization without pivoting, and LU with partial pivoting. The automatic blocking of QR and pivoting LU is a notable achievement—to our knowledge, few previous compiler techniques, including theoretically more general loop transformation frameworks [1, 21, 23, 27, 31], were able to completely automate the blocking of these kernels, and none has produced the same blocking as produced by our technique. These results indicate that with low compilation cost, our technique can in practice match the effectiveness of much more expensive frameworks that are theoretically more powerful.