Theory of linear and integer programming
Theory of linear and integer programming
Strategies for cache and local memory management by global program transformation
Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Proceedings of the 1989 ACM/IEEE conference on Supercomputing
A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Improving register allocation for subscripted variables
PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Global optimizations for parallelism and locality on scalable parallel machines
PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Maximizing parallelism and minimizing synchronization with affine transforms
Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Maximizing parallelism and minimizing synchronization with affine partitions
Parallel Computing - Special issues on languages and compilers for parallel computers
Organizing matrices and matrix operations for paged memory systems
Communications of the ACM
Blocking and array contraction across arbitrarily nested loops using affine partitioning
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Dependence Analysis for Supercomputing
Dependence Analysis for Supercomputing
A Loop Transformation Theory and an Algorithm to Maximize Parallelism
IEEE Transactions on Parallel and Distributed Systems
Communication-Free Parallelization via Affine Transformations
LCPC '94 Proceedings of the 7th International Workshop on Languages and Compilers for Parallel Computing
Improving the performance of virtual memory computers.
Improving the performance of virtual memory computers.
Software methods for improvement of cache performance on supercomputer applications
Software methods for improvement of cache performance on supercomputer applications
Practical Structure Layout Optimization and Advice
Proceedings of the International Symposium on Code Generation and Optimization
Whole-program optimization of global variable layout
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Prefetch throttling and data pinning for improving performance of shared caches
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A compiler framework for general memory layout optimizations targeting structures
Proceedings of the 2010 Workshop on Interaction between Compilers and Computer Architecture
McFLAT: a profile-based framework for MATLAB loop analysis and transformations
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Locality optimization of stencil applications using data dependency graphs
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
PaCT'11 Proceedings of the 11th international conference on Parallel computing technologies
The Journal of Supercomputing
Hi-index | 0.00 |
This paper proposes an algorithm that improves the locality of a loop nest by transforming the code via interchange, reversal, skewing and tiling. The loop transformation algorithm is based on two concepts: a mathematical formulation of reuse and locality, and a loop transformation theory that unifies the various transforms as unimodular matrix transformations.The algorithm has been implemented in the SUIF (Stanford University Intermediate Format) compiler, and is successful in optimizing codes such as matrix multiplication, successive over-relaxation (SOR), LU decomposition without pivoting, and Givens QR factorization. Performance evaluation indicates that locality optimization is especially crucial for scaling up the performance of parallel code.