Blocking and array contraction across arbitrarily nested loops using affine partitioning

Authors:
Amy W. Lim;Shih-Wei Liao;Monica S. Lam
Affiliations:
Computer Systems Laboratory, Stanford University, Stanford, CA;Intel Research in Santa Clara, California and Computer Systems Laboratory, Stanford University, Stanford, CA;Computer Systems Laboratory, Stanford University, Stanford, CA
Venue:
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Year:
2001

Citing 20
Cited 47

Theory of linear and integer programming

Theory of linear and integer programming
Optimization of array accesses by collective loop transformations

ICS '91 Proceedings of the 5th international conference on Supercomputing
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Optimizing for parallelism and data locality

ICS '92 Proceedings of the 6th international conference on Supercomputing
Improving locality and parallelism in nested loops

Improving locality and parallelism in nested loops
Some efficient solutions to the affine scheduling problem: I. One-dimensional time

International Journal of Parallel Programming
Compiler optimizations for improving data locality

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Compiler transformations for high-performance computing

ACM Computing Surveys (CSUR)
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Maximizing parallelism and minimizing synchronization with affine transforms

Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
The implementation and evaluation of fusion and contraction in array languages

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Maximizing parallelism and minimizing synchronization with affine partitions

Parallel Computing - Special issues on languages and compilers for parallel computers
An affine partitioning algorithm to maximize parallelism and minimize communication

ICS '99 Proceedings of the 13th international conference on Supercomputing
Synthesizing transformations for locality enhancement of imperfectly-nested loop nests

Proceedings of the 14th international conference on Supercomputing
Tiling imperfectly-nested loop nests

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
Collective Loop Fusion for Array Contraction

Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Suif explorer: an interactive and interprocedural parallelizer

Suif explorer: an interactive and interprocedural parallelizer

An Efficient Technique for Corner-Turn in SAR Image Reconstruction by Improving Cache Access

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Static Coarse Grain Task Scheduling with Cache Optimization Using OpenMP

ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
Better tiling and array contraction for compiling scientific programs

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Improving effective bandwidth through compiler enhancement of global cache reuse

Journal of Parallel and Distributed Computing
A data locality optimizing algorithm

ACM SIGPLAN Notices - Best of PLDI 1979-1999
Applications of storage mapping optimization to register promotion

Proceedings of the 18th annual international conference on Supercomputing
Static coarse grain task scheduling with cache optimization using OpenMP

International Journal of Parallel Programming - Special issue: OpenMP: Experiences and implementations
Improving Data Locality by Array Contraction

IEEE Transactions on Computers
New Complexity Results on Array Contraction and Related Problems

Journal of VLSI Signal Processing Systems
Cache Miss Characterization and Data Locality Optimization for Imperfectly Nested Loops on Shared Memory Multiprocessors

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Reducing 3D Fast Wavelet Transform Execution Time Using Blocking and the Streaming SIMD Extensions

Journal of VLSI Signal Processing Systems
A polynomial-time algorithm for memory space reduction

International Journal of Parallel Programming
Facilitating the search for compositions of program transformations

Proceedings of the 19th annual international conference on Supercomputing
Obtaining Affine Transformations to Improve Locality of Loop Nests

Programming and Computing Software
Integrated Loop Optimizations for Data Locality Enhancement of Tensor Contraction Expressions

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Impact of modern memory subsystems on cache optimizations for stencil computations

Proceedings of the 2005 workshop on Memory system performance
Data and Computation Transformations for Brook Streaming Applications on Multiprocessors

Proceedings of the International Symposium on Code Generation and Optimization
A consistent generation of pipeline parallelism and distribution of operations and data among processors

Programming and Computing Software
Polyhedral space generation and memory estimation from interface and memory models of real-time video systems

Journal of Systems and Software
Efficient synthesis of out-of-core algorithms using a nonlinear optimization solver

Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
Memory bandwidth optimization through stream descriptors

MEDEA '05 Proceedings of the 2005 workshop on MEmory performance: DEaling with Applications , systems and architecture
Dynamic allocation for scratch-pad memory using compile-time decisions

ACM Transactions on Embedded Computing Systems (TECS)
Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies

International Journal of Parallel Programming
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Hypergraph partitioning for automatic memory hierarchy management

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Comparing memory systems for chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
A step towards unifying schedule and storage optimization

ACM Transactions on Programming Languages and Systems (TOPLAS)
Programming with tiles

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
A practical automatic polyhedral parallelizer and locality optimizer

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
A domain specific interconnect for reconfigurable computing

Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on Languages, compilers, and tools for embedded systems
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Finding Synchronization-Free Slices of Operations in Arbitrarily Nested Loops

ICCSA '08 Proceedings of the international conference on Computational Science and Its Applications, Part II
Comparative evaluation of memory models for chip multiprocessors

ACM Transactions on Architecture and Code Optimization (TACO)
Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Parametric multi-level tiling of imperfectly nested loops

Proceedings of the 23rd international conference on Supercomputing
Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model

CC'08/ETAPS'08 Proceedings of the Joint European Conferences on Theory and Practice of Software 17th international conference on Compiler construction
Streaming Data Movement for Real-Time Image Analysis

Journal of Signal Processing Systems
A programming language interface to describe transformations and code generation

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Hardware/software co-design for energy-efficient seismic modeling

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Removing impediments to loop fusion through code transformations

LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing
Loop transformation recipes for code generation and auto-tuning

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Efficient tiled loop generation: D-tiling

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
On-chip cache hierarchy-aware tile scheduling for multicore machines

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Analysis of pure methods using garbage collection

Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Compiling affine loop nests for distributed-memory parallel architectures

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Tile size selection revisited

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Applicable to arbitrary sequences and nests of loops, affine partitioning is a program transformation framework that unifies many previously proposed loop transformations, including unimodular transforms, fusion, fission, reindexing, scaling and statement reordering. Algorithms based on affine partitioning have been shown to be effective for parallelization and communication minimization. This paper presents algorithms that improve data locality using affine partitioning.Blocking and array contraction are two important optimizations that have been shown to be useful for data locality. Blocking creates a set of inner loops so that data brought into the faster levels of the memory hierarchy can be reused. Array contraction reduces an array to a scalar variable and thereby reduces the number of memory operations executed and the memory footprint. Loop transforms are often necessary to make blocking and array contraction possible. By bringing the full generality of affine partitioning to bear on the problem, our locality algorithm can find more contractable arrays than previously possible. This paper also generalizes the concept of blocking and shows that affine partitioning allows the benefits of blocking be realized in arbitrarily nested loops. Experimental results on a number of benchmarks and a complete multigrid application in aeronautics indicates that affine partitioning is effective in practice.