Compilers: principles, techniques, and tools
Compilers: principles, techniques, and tools
Strategies for cache and local memory management by global program transformation
Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Improving locality and parallelism in nested loops
Improving locality and parallelism in nested loops
Compiling for numa parallel machines
Compiling for numa parallel machines
Unifying data and control transformations for distributed shared-memory machines
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
A model and compilation strategy for out-of-core data parallel programs
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Data and computation transformations for multiprocessors
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Improving data locality with loop transformations
ACM Transactions on Programming Languages and Systems (TOPLAS)
Proceedings of the fourth workshop on I/O in parallel and distributed systems: part of the federated computing research conference
Automatic optimization of communication in compiling out-of-core stencil codes
ICS '96 Proceedings of the 10th international conference on Supercomputing
Automatic compiler-inserted I/O prefetching for out-of-core applications
OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
An extended two-phase method for accessing sections of out-of-core arrays
Scientific Programming
Data-centric multi-level blocking
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Input/output access pattern classification using hidden Markov models
Proceedings of the fifth workshop on I/O in parallel and distributed systems
Automatic parallel I/O performance optimization in Panda
Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
A hyperplane based approach for optimizing spatial locality in loop nests
ICS '98 Proceedings of the 12th international conference on Supercomputing
Computer architecture (2nd ed.): a quantitative approach
Computer architecture (2nd ed.): a quantitative approach
Microprocessor file system interfaces
PDIS '93 Proceedings of the second international conference on Parallel and distributed information systems
Dependence Analysis for Supercomputing
Dependence Analysis for Supercomputing
High Performance Compilers for Parallel Computing
High Performance Compilers for Parallel Computing
Language, compiler and parallel database support for I/O intensive applications
HPCN Europe '95 Proceedings of the International Conference and Exhibition on High-Performance Computing and Networking
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Reuse-Driven Tiling for Data Locality
LCPC '97 Proceedings of the 10th International Workshop on Languages and Compilers for Parallel Computing
Proceedings of the Third International ACPC Conference with Special Emphasis on Parallel Databases and Parallel I/O: Parallel Computation
Compiler support for out-of-core arrays on parallel machines
FRONTIERS '95 Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation (Frontiers'95)
I/O Requirements of Scientific Applications: An Evolutionary View
HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing
A Matrix-Based Approach to the Global Locality Optimization Problem
PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
ViC*: A Preprocessor for Virtual-Memory C*
ViC*: A Preprocessor for Virtual-Memory C*
Automatic Computation and Data Decomposition for Multiprocessors
Automatic Computation and Data Decomposition for Multiprocessors
Techniques for compiling i/o intensive parallel programs
Techniques for compiling i/o intensive parallel programs
Performance modeling and optimization of parallel out-of-core tensor contractions
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Efficient synthesis of out-of-core algorithms using a nonlinear optimization solver
Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
Enhancing the performance of MPI-IO applications by overlapping I/O, computation and communication
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Comparative evaluation of overlap strategies with study of I/O overlap in MPI-IO
ACM SIGOPS Operating Systems Review
Out-of-Core Computations of High-Resolution Level Sets by Means of Code Transformation
Journal of Scientific Computing
Hi-index | 0.00 |
This paper describes a tiling technique that can be used by application programmers and optimizing compilers to obtain I/O-efficient versions of regular scientific loop nests. Due to the particular characteristics of I/O operations, a straightforward extension of the traditional tiling method to I/O-intensive programs may result in poor I/O performance. Therefore, the technique presented in this paper adapts iteration space tiling for I/O-performing loop nests to deliver high I/O performance. The generated code results in huge savings in the number of I/O calls as well as the volume of data transferred between the disk subsystem and main memory. Our experimental results on the IBM SP-2 distributed-memory message-passing multiprocessor demonstrate that the reduction in these two parameters, namely, the number of I/O calls and the transferred data volume, can lead to a marked decrease in overall execution times of I/O-intensive loop nests. In a number of loop nests extracted from several benchmarks and math libraries, we were able to improve the execution times by an average 42.5% for one data set and by an average 47.4% for another.