A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts

Authors:
Mahmut Kandemir;Alok Choudhary;Nagaraj Shenoy;Prithviraj Banerjee;J. Ramanujam
Affiliations:
Syracuse Univ., Syracuse, NY;Northwestern Univ., Evanston, IL;Northwestern Univ., Evanston, IL;Northwestern Univ., Evanston, IL;Louisiana State Univ., Baton Rouge
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
1999

Citing 37
Cited 28

Theory of linear and integer programming

Theory of linear and integer programming
Memory storage patterns in parallel processing

Memory storage patterns in parallel processing
Strategies for cache and local memory management by global program transformation

Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Evaluating Associativity in CPU Caches

IEEE Transactions on Computers
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Delinearization: an efficient way to break multiloop dependence equations

PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Non-unimodular transformations of nested loops

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Global optimizations for parallelism and locality on scalable parallel machines

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Communication-free hyperplane partitioning of nested loops

Journal of Parallel and Distributed Computing
SUIF: an infrastructure for research on parallelizing and optimizing compilers

ACM SIGPLAN Notices
Compiling for numa parallel machines

Compiling for numa parallel machines
Optimal evaluation of array expressions on massively parallel machines

ACM Transactions on Programming Languages and Systems (TOPLAS)
Unifying data and control transformations for distributed shared-memory machines

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Reducing false sharing on shared memory multiprocessors through compile time data transformations

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Automatic data layout for high performance Fortran

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
A novel approach towards automatic data distribution

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Evaluating the impact of advanced memory systems on compiler-parallelized codes

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Operating system support for improving data locality on CC-NUMA compute servers

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Data distribution support on distributed shared memory multiprocessors

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
A compiler algorithm for optimizing locality in loop nests

ICS '97 Proceedings of the 11th international conference on Supercomputing
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
Compilation techniques for out-of-core parallel computations

Parallel Computing - Special issues on languages and compilers for parallel computers
Dynamic data distribution with control flow analysis

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Dependence Analysis for Supercomputing

Dependence Analysis for Supercomputing
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
False Sharing and Spatial Locality in Multiprocessor Caches

IEEE Transactions on Computers
Compiling Communication-Efficient Programs for Massively Parallel Machines

IEEE Transactions on Parallel and Distributed Systems
Compile-Time Techniques for Data Distribution in Distributed Memory Machines

IEEE Transactions on Parallel and Distributed Systems
Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers

IEEE Transactions on Parallel and Distributed Systems
Automatic Partitioning of Data and Computations on Scalable Shared Memory Multiprocessors

ICPP '97 Proceedings of the international Conference on Parallel Processing
Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation

Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Automatic Selection of Dynamic Data Partitioning Schemes for Distributed-Memory Multicomputers

LCPC '95 Proceedings of the 8th International Workshop on Languages and Compilers for Parallel Computing
Compiler Algorithms For Optimizing Locality And Parallelism On Shared And Distributed Memory Machines

PACT '97 Proceedings of the 1997 International Conference on Parallel Architectures and Compilation Techniques
Compile time techniques for parallel execution of loops on distributed memory multiprocessors

Compile time techniques for parallel execution of loops on distributed memory multiprocessors

Exact analysis of the cache behavior of nested loops

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Scheduling and partitioning for multiple loop nests

Proceedings of the 14th international symposium on Systems synthesis
Data Relation Vectors: A New Abstraction for Data Optimizations

IEEE Transactions on Computers - Special issue on the parallel architecture and compilation techniques conference
Skewed Data Partition and Alignment Techniques for Compiling Programs on Distributed Memory Multicomputers

The Journal of Supercomputing
Communication-Free Alignment for Array References with Linear Subscripts in Three Loop Index Variables or Quadratic Subscripts

The Journal of Supercomputing
Automatic Partitioning of Parallel Loops with Parallelepiped-Shaped Tiles

IEEE Transactions on Parallel and Distributed Systems
A Layout-Conscious Iteration Space Transformation Technique

IEEE Transactions on Computers
ParAgent: A Domain-Specific Semi-automatic Parallelization Tool

HiPC '00 Proceedings of the 7th International Conference on High Performance Computing
Skewed Data Partition and Alignment Techniques for Compiling Programs on Distributed Memory Multicomputers

ISHPC '00 Proceedings of the Third International Symposium on High Performance Computing
Reducing False Sharing and Improving Spatial Locality in a Unified Compilation Framework

IEEE Transactions on Parallel and Distributed Systems
Data cache locking for higher program predictability

SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Instruction Scheduling for Low Power

Journal of VLSI Signal Processing Systems
A fast and accurate framework to analyze and optimize cache memory behavior

ACM Transactions on Programming Languages and Systems (TOPLAS)
Efficient and Accurate Analytical Modeling of Whole-Program Data Cache Behavior

IEEE Transactions on Computers
Support and optimization for parallel sparse programs with array intrinsics of Fortran 90

Parallel Computing
New Complexity Results on Array Contraction and Related Problems

Journal of VLSI Signal Processing Systems
Improving superword level parallelism support in modern compilers

CODES+ISSS '05 Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
An accurate cost model for guiding data locality transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Improving the energy behavior of block buffering using compiler optimizations

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Fast indexing for blocked array layouts to reduce cache misses

International Journal of High Performance Computing and Networking
Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor

The Journal of Supercomputing
Partial access conflict-relieving programmable address shuffler for parallel memory system in multi-core processor

Microprocessors & Microsystems
A grid-based programming approach for distributed linear algebra applications

Multiagent and Grid Systems
Data layout transformation for stencil computations on short-vector SIMD architectures

CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Empirical performance model-driven data layout optimization and library call selection for tensor contraction expressions

Journal of Parallel and Distributed Computing
Optimizing data locality using array tiling

Proceedings of the International Conference on Computer-Aided Design
Empirical performance-model driven data layout optimization

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
A data layout optimization framework for NUCA-based multicores

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper presents a data layout optimization technique for sequential and parallel programs based on the theory of hyperplanes from linear algebra. Given a program, our framework automatically determines suitable memory layouts that can be expressed by hyperplanes for each array that is referenced. We discuss the cases where data transformations are preferable to loop transformations and show that under certain conditions a loop nest can be optimized for perfect spatial locality by using data transformations. We argue that data transformations can also optimize spatial locality for some arrays without distorting temporal/spatial locality exhibited by others. We divide the problem of optimizing data layout into two independent subproblems: 1) determining optimal static data layouts, and 2) determining data transformation matrices to implement the optimal layouts. By postponing the determination of the transformation matrix to the last stage, our method can be adapted to compilers with different default layouts. We then present an algorithm that considers optimizing parallelism and spatial locality simultaneously. Our results on eight programs on two distributed shared-memory multiprocessors, the Convex Exemplar SPP-2000 and the SGI Origin 2000, show that the layout optimizations are effective in optimizing spatial locality and parallelism.