Achieving Scalable Locality with Time Skewing

Authors:
David Wonnacott
Affiliations:
Department of Computer Science, Haverford College, Haverford, Pennsylvania. davew@cs.haverford.edu
Venue:
International Journal of Parallel Programming
Year:
2002

Citing 26
Cited 17

Estimating interlock and improving balance for pipelined architectures

Journal of Parallel and Distributed Computing
Strategies for cache and local memory management by global program transformation

Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Eliminating false data dependences using the Omega test

PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Algorithms in C++

Algorithms in C++
To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Counting solutions to Presburger formulas: how and why

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
The Omega Library interface guide

The Omega Library interface guide
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Data distribution support on distributed shared memory multiprocessors

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
On the Automatic Parallelization of the Perfect Benchmarks®

IEEE Transactions on Parallel and Distributed Systems
Constraint-based array dependence analysis

ACM Transactions on Programming Languages and Systems (TOPLAS)
New tiling techniques to improve cache temporal locality

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Transforming loops to recursion for multi-level memory hierarchies

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Data locality enhancement by memory reduction

ICS '01 Proceedings of the 15th international conference on Supercomputing
Compiling stencils in high performance Fortran

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Optimizing Supercompilers for Supercomputers

Optimizing Supercompilers for Supercomputers
Baring It All to Software: Raw Machines

Computer
Experience in the Automatic Parallelization of Four Perfect-Benchmark Programs

Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
An Exact Method for Analysis of Value-based Array Data Dependences

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Code generation for multiple mappings

FRONTIERS '95 Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation (Frontiers'95)
Using Time Skewing to Eliminate Idle Time due to Memory Bandwidth and Network Limitations

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Fine-grained analysis of array computations

Fine-grained analysis of array computations

Experiences tuning SMG98: a semicoarsening multigrid benchmark based on the hypre library

ICS '02 Proceedings of the 16th international conference on Supercomputing
Improving effective bandwidth through compiler enhancement of global cache reuse

Journal of Parallel and Distributed Computing
Automatic tiling of iterative stencil loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
The Potential of Computation Regrouping for Improving Locality

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Sparse Tiling for Stationary Iterative Methods

International Journal of High Performance Computing Applications
Intermediately executed code is the key to find refactorings that improve temporal data locality

Proceedings of the 3rd conference on Computing frontiers
Sketching stencils

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Smashing: Folding Space to Tile through Time

Languages and Compilers for Parallel Computing
Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs

Proceedings of the 23rd international conference on Supercomputing
Program locality analysis using reuse distance

ACM Transactions on Programming Languages and Systems (TOPLAS)
Two examples of parallel programming without concurrency constructs (PP-CC)

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Data layout transformation for stencil computations on short-vector SIMD architectures

CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Safe parallel programming using dynamic dependence hints

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Out-of-Core Computations of High-Resolution Level Sets by Means of Code Transformation

Journal of Scientific Computing
Combining performance aspects of irregular gauss-seidel via sparse tiling

LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing
Distributed Shared Memory and Compiler-Induced Scalable Locality for Scalable Cluster Performance

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Tile size selection revisited

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Microprocessor speed has been growing exponentially faster than memory system speed in the recent past. This paper explores the long term implications of this trend. We define scalable locality, which measures our ability to apply ever faster processors to increasingly large problems (just as scalable parallelism measures our ability to apply more numerous processors to larger problems). We provide an algorithm called time skewing that derives an execution order and storage mapping to produce any desired degree of locality, for certain programs that can be made to exhibit scalable locality. Our approach is unusual in that it derives the transformation from the algorithm's dataflow (a fundamental characteristic of the algorithm) instead of searching a space of transformations of the execution order and array layout used by the programmer (artifacts of the expression of the algorithm). We provide empirical results for data sets using L2 cache, main memory, and virtual memory.