A Layout-Conscious Iteration Space Transformation Technique

Authors:
M. Kandemir;J. Ramanujam;A. Choudhary;P. Banerjee
Affiliations:
-;-;-;-
Venue:
IEEE Transactions on Computers
Year:
2001

Citing 36
Cited 6

More iteration space tiling

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Non-unimodular transformations of nested loops

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Communication-free hyperplane partitioning of nested loops

Journal of Parallel and Distributed Computing
Compiling for numa parallel machines

Compiling for numa parallel machines
Improving the ratio of memory operations to floating-point operations in loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
Unifying data and control transformations for distributed shared-memory machines

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Reducing false sharing on shared memory multiprocessors through compile time data transformations

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The Omega Library interface guide

The Omega Library interface guide
Beyond unimodular transformations

The Journal of Supercomputing
Automatic data layout for high performance Fortran

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Compiler cache optimizations for banded matrix problems

ICS '95 Proceedings of the 9th international conference on Supercomputing
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
DDT: a research tool for automatic data distribution in high performance Fortran

Scientific Programming - Special issue: High Performance Fortran comes of age
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Non-singular data transformations: definition, validity and applications

ICS '97 Proceedings of the 11th international conference on Supercomputing
Data transformations for eliminating conflict misses

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach
Improving locality using loop and data transformations in an integrated framework

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Improving Cache Locality by a Combination of Loop and Data Transformations

IEEE Transactions on Computers - Special issue on cache memory and related problems
A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts

IEEE Transactions on Parallel and Distributed Systems
Transformations for imperfectly nested loops

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
Compiling Communication-Efficient Programs for Massively Parallel Machines

IEEE Transactions on Parallel and Distributed Systems
A Loop Transformation Theory and an Algorithm to Maximize Parallelism

IEEE Transactions on Parallel and Distributed Systems
Compile-Time Techniques for Data Distribution in Distributed Memory Machines

IEEE Transactions on Parallel and Distributed Systems
Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers

IEEE Transactions on Parallel and Distributed Systems
Automatic Partitioning of Data and Computations on Scalable Shared Memory Multiprocessors

ICPP '97 Proceedings of the international Conference on Parallel Processing
Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation

Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Compiler Algorithms For Optimizing Locality And Parallelism On Shared And Distributed Memory Machines

PACT '97 Proceedings of the 1997 International Conference on Parallel Architectures and Compilation Techniques
A Matrix-Based Approach to the Global Locality Optimization Problem

PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Integrating Loop and Data Transformations for Global Optimisation

PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Automatic Computation and Data Decomposition for Multiprocessors

Automatic Computation and Data Decomposition for Multiprocessors

Global memory optimisation for embedded systems allowed by code duplication

SCOPES '05 Proceedings of the 2005 workshop on Software and compilers for embedded systems
Combining analytical and empirical approaches in tuning matrix transposition

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Fast indexing for blocked array layouts to reduce cache misses

International Journal of High Performance Computing and Networking
Storage Estimation and Design Space Exploration Methodologies for the Memory Management of Signal Processing Applications

Journal of Signal Processing Systems
Trade-offs in loop transformations

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Systematic preprocessing of data dependent constructs for embedded systems

PATMOS'05 Proceedings of the 15th international conference on Integrated Circuit and System Design: power and Timing Modeling, Optimization and Simulation

Quantified Score

Hi-index	14.98

Visualization

Abstract

Exploiting locality of references has become extremely important in realizing the potential performance of modern machines with deep memory hierarchies. The data access patterns of programs and the memory layouts of the accessed data sets play a critical role in determining the performance of applications running on these machines. This paper presents a cache locality optimization technique that can optimize a loop nest even if the arrays referenced have different layouts in memory. Such a capability is required for a global locality optimization framework that applies both loop and data transformations to a sequence of loop nests for optimizing locality. Our method uses a single linear algebra framework to represent both data layouts and loop transformations. It computes a nonsingular loop transformation matrix such that, in a given loop nest, data locality is exploited in the innermost loops, where it is most useful. The inverse of a nonsingular transformation matrix is built column-by-column, starting from the rightmost column. In addition, our approach can work in those cases where the data layouts of a subset of the referenced arrays is unknown; this is a key step in optimizing a sequence of loop nests and whole programs for locality. Experimental results on an SGI/Cray Origin 2000 nonuniform memory access multiprocessor machine show that our technique reduces execution times by as much as 70 percent.