Automatic tiling of iterative stencil loops

Authors:
Zhiyuan Li;Yonghong Song
Affiliations:
Purdue University, West Lafayette, IN;Purdue University, West Lafayette, IN
Venue:
ACM Transactions on Programming Languages and Systems (TOPLAS)
Year:
2004

Citing 48
Cited 17

Automatic translation of FORTRAN programs to vector form

ACM Transactions on Programming Languages and Systems (TOPLAS)
Coloring heuristics for register allocation

PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
Introduction to algorithms

Introduction to algorithms
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A practical algorithm for exact array dependence analysis

Communications of the ACM
Network flows: theory, algorithms, and applications

Network flows: theory, algorithms, and applications
Improving locality and parallelism in nested loops

Improving locality and parallelism in nested loops
Cache interference phenomena

SIGMETRICS '94 Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Software pipelining

ACM Computing Surveys (CSUR)
Memory bandwidth limitations of future microprocessors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Exploiting monotone convergence functions in parallel programs

Exploiting monotone convergence functions in parallel programs
Fusion of Loops for Parallelism and Locality

IEEE Transactions on Parallel and Distributed Systems
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Non-singular data transformations: definition, validity and applications

ICS '97 Proceedings of the 11th international conference on Supercomputing
Experience with efficient array data flow analysis for array privatization

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach
Interprocedural analysis for loop scheduling and data allocation

Parallel Computing - Special issues on languages and compilers for parallel computers
Schedule-independent storage mapping for loops

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Precise miss analysis for program transformations with caches of arbitrary associativity

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Augmenting Loop Tiling with Data Alignment for Improved Cache Performance

IEEE Transactions on Computers - Special issue on cache memory and related problems
Nonlinear and Symbolic Data Dependence Testing

IEEE Transactions on Parallel and Distributed Systems
New tiling techniques to improve cache temporal locality

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Nonlinear array layouts for hierarchical memory systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
A tile selection algorithm for data locality and cache interference

ICS '99 Proceedings of the 13th international conference on Supercomputing
Recursive array layouts and fast parallel matrix multiplication

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Fast greedy weighted fusion

Proceedings of the 14th international conference on Supercomputing
Synthesizing transformations for locality enhancement of imperfectly-nested loop nests

Proceedings of the 14th international conference on Supercomputing
Loop tiling for parallelism

Loop tiling for parallelism
Transformations for imperfectly nested loops

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Tiling optimizations for 3D scientific computations

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Data locality enhancement by memory reduction

ICS '01 Proceedings of the 15th international conference on Supercomputing
Loop optimization for a class of memory-constrained computations

ICS '01 Proceedings of the 15th international conference on Supercomputing
Language support for Morton-order matrices

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
Increasing temporal locality with skewing and recursive blocking

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Quantifying the Multi-Level Nature of Tiling Interactions

International Journal of Parallel Programming
Achieving Scalable Locality with Time Skewing

International Journal of Parallel Programming
Improving Effective Bandwidth through Compiler Enhancement of Global Cache Reuse

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
On Estimating and Enhancing Cache Effectiveness

Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
A Comparison of Compiler Tiling Algorithms

CC '99 Proceedings of the 8th International Conference on Compiler Construction, Held as Part of the European Joint Conferences on the Theory and Practice of Software, ETAPS'99
A compiler framework for restructuring data declarations to enhance cache and TLB effectiveness

CASCON '94 Proceedings of the 1994 conference of the Centre for Advanced Studies on Collaborative research
A Matrix-Based Approach to the Global Locality Optimization Problem

PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Analysis of Memory Hierarchy Performance of Block Data Layout

ICPP '02 Proceedings of the 2002 International Conference on Parallel Processing
Fine-grained analysis of array computations

Fine-grained analysis of array computations

Impact of modern memory subsystems on cache optimizations for stencil computations

Proceedings of the 2005 workshop on Memory system performance
The potential of the cell processor for scientific computing

Proceedings of the 3rd conference on Computing frontiers
Generation and optimisation of code using Coxeter lattice paths

Proceedings of the 2007 international workshop on Parallel symbolic computation
Scientific computing Kernels on the cell processor

International Journal of Parallel Programming
Analyzing memory access intensity in parallel programs on multicore

Proceedings of the 22nd annual international conference on Supercomputing
Using Padding to Optimize Locality in Scientific Applications

ICCS '08 Proceedings of the 8th international conference on Computational Science, Part I
Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs

Proceedings of the 23rd international conference on Supercomputing
Simultaneous minimization of capacity and conflict misses

Journal of Computer Science and Technology
Data layout transformation for stencil computations on short-vector SIMD architectures

CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Automatic code generation and tuning for stencil kernels on modern shared memory architectures

Computer Science - Research and Development
Out-of-Core Computations of High-Resolution Level Sets by Means of Code Transformation

Journal of Scientific Computing
Optimizing stencil application on multi-thread GPU architecture using stream programming model

ARCS'10 Proceedings of the 23rd international conference on Architecture of Computing Systems
Auto-generation and auto-tuning of 3D stencil codes on GPU clusters

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Patus for convenient high-performance stencils: evaluation in earthquake simulations

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A high-level synthesis flow for the implementation of iterative stencil loop algorithms on FPGA devices

Proceedings of the 50th Annual Design Automation Conference
Optimizing the performance of streaming numerical kernels on the IBM Blue Gene/P PowerPC 450 processor

International Journal of High Performance Computing Applications
An application-centric evaluation of OpenCL on multi-core CPUs

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Iterative stencil loops are used in scientific programs to implement relaxation methods for numerical simulation and signal processing. Such loops iteratively modify the same array elements over different time steps, which presents opportunities for the compiler to improve the temporal data locality through loop tiling. This article presents a compiler framework for automatic tiling of iterative stencil loops, with the objective of improving the cache performance. The article first presents a technique which allows loop tiling to satisfy data dependences in spite of the difficulty created by imperfectly nested inner loops. It does so by skewing the inner loops over the time steps and by applying a uniform skew factor to all loops at the same nesting level. Based on a memory cost analysis, the article shows that the skew factor must be minimized at every loop level in order to minimize cache misses. A graph-theoretical algorithm, which takes polynomial time, is presented to determine the minimum skew factor. Furthermore, the memory-cost analysis derives the tile size which minimizes capacity misses. Given the tile size, an efficient and general array-padding scheme is applied to remove conflict misses. Experiments were conducted on 16 test programs and preliminary results showed an average speedup of 1.58 and a maximum speedup of 5.06 across those test programs.