Memory storage patterns in parallel processing
Memory storage patterns in parallel processing
POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Computer architecture: a quantitative approach
Computer architecture: a quantitative approach
A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Compiling Fortran D for MIMD distributed-memory machines
Communications of the ACM
Global optimizations for parallelism and locality on scalable parallel machines
PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Access normalization: loop restructuring for NUMA computers
ACM Transactions on Computer Systems (TOCS)
SUIF: an infrastructure for research on parallelizing and optimizing compilers
ACM SIGPLAN Notices
Compiling for numa parallel machines
Compiling for numa parallel machines
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
A model and compilation strategy for out-of-core data parallel programs
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Data and computation transformations for multiprocessors
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Evaluating the impact of advanced memory systems on compiler-parallelized codes
PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Improving data locality with loop transformations
ACM Transactions on Programming Languages and Systems (TOPLAS)
Automatic optimization of communication in compiling out-of-core stencil codes
ICS '96 Proceedings of the 10th international conference on Supercomputing
Automatic compiler-inserted I/O prefetching for out-of-core applications
OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Automatic data layout for distributed memory machines
Automatic data layout for distributed memory machines
Data-centric multi-level blocking
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Non-singular data transformations: definition, validity and applications
ICS '97 Proceedings of the 11th international conference on Supercomputing
Proceedings of the fifth workshop on I/O in parallel and distributed systems
Organizing matrices and matrix operations for paged memory systems
Communications of the ACM
High Performance Compilers for Parallel Computing
High Performance Compilers for Parallel Computing
A Loop Transformation Theory and an Algorithm to Maximize Parallelism
IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Improving the Performance of Out-of-Core Computations
ICPP '97 Proceedings of the international Conference on Parallel Processing
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
On Estimating and Enhancing Cache Effectiveness
Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
ViC*: A Preprocessor for Virtual-Memory C*
ViC*: A Preprocessor for Virtual-Memory C*
Unifying Data and Control Transformations for Distributed Shared Memory Machines
Unifying Data and Control Transformations for Distributed Shared Memory Machines
Automatic data and computation mapping for distributed-memory machines
Automatic data and computation mapping for distributed-memory machines
Techniques for compiling i/o intensive parallel programs
Techniques for compiling i/o intensive parallel programs
Global I/O Optimizations for Out-of-Core Computations
HIPC '97 Proceedings of the Fourth International Conference on High-Performance Computing
Compiler supported high-level abstractions for sparse disk-resident datasets
ICS '02 Proceedings of the 16th international conference on Supercomputing
Automatic Partitioning of Parallel Loops with Parallelepiped-Shaped Tiles
IEEE Transactions on Parallel and Distributed Systems
Compiler support for efficient processing of XML datasets
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Performance Analysis of a Myrinet-Based Cluster
Cluster Computing
Performance modeling and optimization of parallel out-of-core tensor contractions
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Data Centric Transformations on Non-Integer Iteration Spaces
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Optimizing locality and scalability of embedded Runge--Kutta solvers using block-based pipelining
Journal of Parallel and Distributed Computing
Efficient synthesis of out-of-core algorithms using a nonlinear optimization solver
Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
Compiler and middleware support for scalable data mining
LCPC'01 Proceedings of the 14th international conference on Languages and compilers for parallel computing
Out-of-Core Computations of High-Resolution Level Sets by Means of Code Transformation
Journal of Scientific Computing
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Hi-index | 0.00 |
This paper presents a unified framework that optimizes out-of-core programs by exploiting locality and parallelism, and reducing communication overhead. For out-of-core problems where the data set sizes far exceed the size of the available in-core memory, it is particularly important to exploit the memory hierarchy by optimizing the I/O accesses. We present algorithms that consider both iteration space (loop) and data space (file layout) transformations in a unified framework. We show that the performance of an out-of-core loop nest containing references to out-of-core arrays can be improved by using a suitable combination of file layout choices and loop restructuring transformations. Our approach considers array references one-by-one and attempts to optimize each reference for parallelism and locality. When there are references for which parallelism optimizations do not work, communication is vectorized so that data transfer can be performed before the innermost loop. Results from hand-compiles on IBM SP-2 and Intel Paragon distributed-memory message-passing architectures show that this approach reduces the execution times and improves the overall speedups. In addition, we extend the base algorithm to work with file layout constraints and show how it is useful for optimizing programs that consist of multiple loop nests.