A Unified Framework for Optimizing Locality, Parallelism, and Communication in Out-of-Core Computations

Authors:
Mahmut Kandemir;Alok Choudhary;J. Ramanujam;Meenakshi A. Kandaswamy
Affiliations:
Pennsylvania State Univ., Univ. Park;Northwestern Univ., Evanston, IL;Louisiana State Univ., Baton Rouge;Intel Corp., Hillsboro, OR
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2000

Citing 34
Cited 11

Memory storage patterns in parallel processing

Memory storage patterns in parallel processing
Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Computer architecture: a quantitative approach

Computer architecture: a quantitative approach
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Compiling Fortran D for MIMD distributed-memory machines

Communications of the ACM
Global optimizations for parallelism and locality on scalable parallel machines

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Access normalization: loop restructuring for NUMA computers

ACM Transactions on Computer Systems (TOCS)
SUIF: an infrastructure for research on parallelizing and optimizing compilers

ACM SIGPLAN Notices
Compiling for numa parallel machines

Compiling for numa parallel machines
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
A model and compilation strategy for out-of-core data parallel programs

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Evaluating the impact of advanced memory systems on compiler-parallelized codes

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Automatic optimization of communication in compiling out-of-core stencil codes

ICS '96 Proceedings of the 10th international conference on Supercomputing
Automatic compiler-inserted I/O prefetching for out-of-core applications

OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Automatic data layout for distributed memory machines

Automatic data layout for distributed memory machines
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Non-singular data transformations: definition, validity and applications

ICS '97 Proceedings of the 11th international conference on Supercomputing
A unified compiler algorithm for optimizing locality, parallelism and communication in out-of-core computations

Proceedings of the fifth workshop on I/O in parallel and distributed systems
Organizing matrices and matrix operations for paged memory systems

Communications of the ACM
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
A Loop Transformation Theory and an Algorithm to Maximize Parallelism

IEEE Transactions on Parallel and Distributed Systems
Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers

IEEE Transactions on Parallel and Distributed Systems
Improving the Performance of Out-of-Core Computations

ICPP '97 Proceedings of the international Conference on Parallel Processing
Data Access Reorganizations in Compiling Out-of-Core Data Parallel Programs on Distributed Memory Machines

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
On Estimating and Enhancing Cache Effectiveness

Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
ViC*: A Preprocessor for Virtual-Memory C*

ViC*: A Preprocessor for Virtual-Memory C*
Unifying Data and Control Transformations for Distributed Shared Memory Machines

Unifying Data and Control Transformations for Distributed Shared Memory Machines
Automatic data and computation mapping for distributed-memory machines

Automatic data and computation mapping for distributed-memory machines
Techniques for compiling i/o intensive parallel programs

Techniques for compiling i/o intensive parallel programs
Global I/O Optimizations for Out-of-Core Computations

HIPC '97 Proceedings of the Fourth International Conference on High-Performance Computing

Compiler supported high-level abstractions for sparse disk-resident datasets

ICS '02 Proceedings of the 16th international conference on Supercomputing
Automatic Partitioning of Parallel Loops with Parallelepiped-Shaped Tiles

IEEE Transactions on Parallel and Distributed Systems
Compiler support for efficient processing of XML datasets

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Performance Analysis of a Myrinet-Based Cluster

Cluster Computing
Performance modeling and optimization of parallel out-of-core tensor contractions

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Data Centric Transformations on Non-Integer Iteration Spaces

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Optimizing locality and scalability of embedded Runge--Kutta solvers using block-based pipelining

Journal of Parallel and Distributed Computing
Efficient synthesis of out-of-core algorithms using a nonlinear optimization solver

Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
Compiler and middleware support for scalable data mining

LCPC'01 Proceedings of the 14th international conference on Languages and compilers for parallel computing
Out-of-Core Computations of High-Resolution Level Sets by Means of Code Transformation

Journal of Scientific Computing
Supporting XML based high-level abstractions on HDF5 datasets: a case study in automatic data virtualization

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a unified framework that optimizes out-of-core programs by exploiting locality and parallelism, and reducing communication overhead. For out-of-core problems where the data set sizes far exceed the size of the available in-core memory, it is particularly important to exploit the memory hierarchy by optimizing the I/O accesses. We present algorithms that consider both iteration space (loop) and data space (file layout) transformations in a unified framework. We show that the performance of an out-of-core loop nest containing references to out-of-core arrays can be improved by using a suitable combination of file layout choices and loop restructuring transformations. Our approach considers array references one-by-one and attempts to optimize each reference for parallelism and locality. When there are references for which parallelism optimizations do not work, communication is vectorized so that data transfer can be performed before the innermost loop. Results from hand-compiles on IBM SP-2 and Intel Paragon distributed-memory message-passing architectures show that this approach reduces the execution times and improves the overall speedups. In addition, we extend the base algorithm to work with file layout constraints and show how it is useful for optimizing programs that consist of multiple loop nests.