Increasing temporal locality with skewing and recursive blocking

Authors:
Guohua Jin;John Mellor-Crummey;Robert Fowler
Affiliations:
Rice University, Houston, TX;Rice University, Houston, TX;Rice University, Houston, TX
Venue:
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Year:
2001

Citing 19
Cited 15

Loop skewing: the wavefront method revisited

International Journal of Parallel Programming
Improving register allocation for subscripted variables

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Interprocedural transformations for parallel code generation

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Compiler blockability of numerical algorithms

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Improving the ratio of memory operations to floating-point operations in loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Using integer sets for data-parallel program analysis and optimization

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
New tiling techniques to improve cache temporal locality

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Synthesizing transformations for locality enhancement of imperfectly-nested loop nests

Proceedings of the 14th international conference on Supercomputing
Transforming loops to recursion for multi-level memory hierarchies

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Semicoarsening Multigrid on Distributed Memory Machines

SIAM Journal on Scientific Computing
Tiling imperfectly-nested loop nests

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
High performance Fortran compilation techniques for parallelizing scientific codes

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
Quantifying the Multi-Level Nature of Tiling Interactions

International Journal of Parallel Programming

Experiences tuning SMG98: a semicoarsening multigrid benchmark based on the hypre library

ICS '02 Proceedings of the 16th international conference on Supercomputing
Better tiling and array contraction for compiling scientific programs

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Automatic tiling of iterative stencil loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
SFCGen: A framework for efficient generation of multi-dimensional space-filling curves by recursion

ACM Transactions on Mathematical Software (TOMS)
Sparse Tiling for Stationary Iterative Methods

International Journal of High Performance Computing Applications
Integrated Loop Optimizations for Data Locality Enhancement of Tensor Contraction Expressions

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
The potential of the cell processor for scientific computing

Proceedings of the 3rd conference on Computing frontiers
Optimizing locality and scalability of embedded Runge--Kutta solvers using block-based pipelining

Journal of Parallel and Distributed Computing
Compact Hilbert indices: Space-filling curves for domains with unequal side lengths

Information Processing Letters
Scientific computing Kernels on the cell processor

International Journal of Parallel Programming
Smashing: Folding Space to Tile through Time

Languages and Compilers for Parallel Computing
Generating Empirically Optimized Composed Matrix Kernels from MATLAB Prototypes

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
A graph theoretic approach to cache-conscious placement of data for direct mapped caches

Proceedings of the 2010 international symposium on Memory management
Efficient execution of time-step computations with pipelined parallelism and inter-thread data locality optimizaitions

Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
Combining performance aspects of irregular gauss-seidel via sparse tiling

LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a strategy, called recursive prismatic time skewing, that increase temporal reuse at all memory hierarchy levels, thus improving the performance of scientific codes that use iterative methods. Prismatic time skewing partitions iteration space of multiple loops into skewed prisms with both spatial and temporal (or convergence) dimensions. Novel aspects of this work include: multi-dimensional loop skewing; handling carried data dependences in the skewed loops without additional storage; bi-directional skewing to accommodate periodic boundary conditions; and an analysis and transformation strategy that works inter-procedurally. We combine prismatic skewing with a recursive blocking strategy to boost reuse at all levels in a memory hierarchy. A preliminary evaluation of these techniques shows significant performance improvements compared both to original codes and to methods described previously in the literature. With an inter-procedural application of our techniques, we were able to reduce total primary cache misses of a large application code by 27% and secondary cache misses by 119%.