Estimating interlock and improving balance for pipelined architectures
Journal of Parallel and Distributed Computing
Strategies for cache and local memory management by global program transformation
Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Eliminating false data dependences using the Omega test
PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Design and evaluation of a compiler algorithm for prefetching
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Algorithms in C++
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Counting solutions to Presburger formulas: how and why
PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
The Omega Library interface guide
The Omega Library interface guide
Improving data locality with loop transformations
ACM Transactions on Programming Languages and Systems (TOPLAS)
Data distribution support on distributed shared memory multiprocessors
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
On the Automatic Parallelization of the Perfect Benchmarks®
IEEE Transactions on Parallel and Distributed Systems
Constraint-based array dependence analysis
ACM Transactions on Programming Languages and Systems (TOPLAS)
New tiling techniques to improve cache temporal locality
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Transforming loops to recursion for multi-level memory hierarchies
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Data locality enhancement by memory reduction
ICS '01 Proceedings of the 15th international conference on Supercomputing
Compiling stencils in high performance Fortran
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Optimizing Supercompilers for Supercomputers
Optimizing Supercompilers for Supercomputers
Experience in the Automatic Parallelization of Four Perfect-Benchmark Programs
Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
An Exact Method for Analysis of Value-based Array Data Dependences
Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Code generation for multiple mappings
FRONTIERS '95 Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation (Frontiers'95)
Using Time Skewing to Eliminate Idle Time due to Memory Bandwidth and Network Limitations
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Fine-grained analysis of array computations
Fine-grained analysis of array computations
Experiences tuning SMG98: a semicoarsening multigrid benchmark based on the hypre library
ICS '02 Proceedings of the 16th international conference on Supercomputing
Improving effective bandwidth through compiler enhancement of global cache reuse
Journal of Parallel and Distributed Computing
Automatic tiling of iterative stencil loops
ACM Transactions on Programming Languages and Systems (TOPLAS)
The Potential of Computation Regrouping for Improving Locality
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Sparse Tiling for Stationary Iterative Methods
International Journal of High Performance Computing Applications
Intermediately executed code is the key to find refactorings that improve temporal data locality
Proceedings of the 3rd conference on Computing frontiers
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Smashing: Folding Space to Tile through Time
Languages and Compilers for Parallel Computing
Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs
Proceedings of the 23rd international conference on Supercomputing
Program locality analysis using reuse distance
ACM Transactions on Programming Languages and Systems (TOPLAS)
Two examples of parallel programming without concurrency constructs (PP-CC)
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Data layout transformation for stencil computations on short-vector SIMD architectures
CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Safe parallel programming using dynamic dependence hints
Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Out-of-Core Computations of High-Resolution Level Sets by Means of Code Transformation
Journal of Scientific Computing
Combining performance aspects of irregular gauss-seidel via sparse tiling
LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing
Distributed Shared Memory and Compiler-Induced Scalable Locality for Scalable Cluster Performance
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
Microprocessor speed has been growing exponentially faster than memory system speed in the recent past. This paper explores the long term implications of this trend. We define scalable locality, which measures our ability to apply ever faster processors to increasingly large problems (just as scalable parallelism measures our ability to apply more numerous processors to larger problems). We provide an algorithm called time skewing that derives an execution order and storage mapping to produce any desired degree of locality, for certain programs that can be made to exhibit scalable locality. Our approach is unusual in that it derives the transformation from the algorithm's dataflow (a fundamental characteristic of the algorithm) instead of searching a space of transformations of the execution order and array layout used by the programmer (artifacts of the expression of the algorithm). We provide empirical results for data sets using L2 cache, main memory, and virtual memory.