Efficient synthesis of out-of-core algorithms using a nonlinear optimization solver

Authors:
Sandhya Krishnan;Sriram Krishnamoorthy;Gerald Baumgartner;Chi-Chung Lam;J. Ramanujam;P. Sadayappan;Venkatesh Choppella
Affiliations:
Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210, USA;Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210, USA;Department of Computer Science, Louisiana State University, Baton Rouge, LA 70803, USA;Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210, USA;Department of Electrical and Computer Engineering, Louisiana State University, Baton Rouge, LA 70803, USA;Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210, USA;Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210, USA and Indian Institute of Information Technology and Management---Kerala, Technopark, Thiruvanantha ...
Venue:
Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
Year:
2006

Citing 35
Cited 2

A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
A model and compilation strategy for out-of-core data parallel programs

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Compiler cache optimizations for banded matrix problems

ICS '95 Proceedings of the 9th international conference on Supercomputing
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Automatic optimization of communication in compiling out-of-core stencil codes

ICS '96 Proceedings of the 10th international conference on Supercomputing
Automatic compiler-inserted I/O prefetching for out-of-core applications

OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Automatic parallel I/O performance optimization in Panda

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
Eliminating conflict misses for high performance architectures

ICS '98 Proceedings of the 12th international conference on Supercomputing
Compilation techniques for out-of-core parallel computations

Parallel Computing - Special issues on languages and compilers for parallel computers
Combining loop transformations considering caches and scheduling

International Journal of Parallel Programming - Special issue: MICRO-29, 29th annual IEEE/ACM international symposium on microarchitecture
New tiling techniques to improve cache temporal locality

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
An experimental evaluation of tiling and shackling for memory hierarchy management

ICS '99 Proceedings of the 13th international conference on Supercomputing
Quantifying the multi-level nature of tiling interactions

International Journal of Parallel Programming
Cache miss equations: a compiler framework for analyzing and tuning memory behavior

ACM Transactions on Programming Languages and Systems (TOPLAS)
Synthesizing transformations for locality enhancement of imperfectly-nested loop nests

Proceedings of the 14th international conference on Supercomputing
A Unified Framework for Optimizing Locality, Parallelism, and Communication in Out-of-Core Computations

IEEE Transactions on Parallel and Distributed Systems
Loop optimization for a class of memory-constrained computations

ICS '01 Proceedings of the 15th international conference on Supercomputing
Blocking and array contraction across arbitrarily nested loops using affine partitioning

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Space-time trade-off optimization for a class of electronic structure calculations

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Synthesizing Transformations for Locality Enhancement of Imperfectly-Nested Loop Nests

International Journal of Parallel Programming
An I/O-Conscious Tiling Strategy for Disk-Resident Data Sets

The Journal of Supercomputing
Towards Automatic Synthesis of High-Performance Codes for Electronic Structure Calculations: Data Locality Optimization

HiPC '01 Proceedings of the 8th International Conference on High Performance Computing
Memory-Optimal Evaluation of Expression Trees Involving Large Objects

HiPC '99 Proceedings of the 6th International Conference on High Performance Computing
Optimization of Memory Usage Requirement for a Class of Loops Implementing Multi-dimensional Integrals

LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
Optimal Anytime Constrained Simulated Annealing for Constrained Global Optimization

CP '02 Proceedings of the 6th International Conference on Principles and Practice of Constraint Programming
A high-level approach to synthesis of high-performance codes for quantum chemistry

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Global Communication Optimization for Tensor Contraction Expressions under Memory Constraints

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Constrained genetic algorithms and their applications in nonlinear constrained optimization

ICTAI '00 Proceedings of the 12th IEEE International Conference on Tools with Artificial Intelligence
Techniques for compiling i/o intensive parallel programs

Techniques for compiling i/o intensive parallel programs
Performance optimization of a class of loops implementing multidimensional integrals

Performance optimization of a class of loops implementing multidimensional integrals

Efficient search-space pruning for integrated fusion and tiling transformations

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Empirical performance model-driven data layout optimization and library call selection for tensor contraction expressions

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We address the problem of efficient out-of-core code generation for a special class of imperfectly nested loops encoding tensor contractions arising in quantum chemistry computations. These loops operate on arrays too large to fit in physical memory. The problem involves determining optimal tiling of loops and placement of disk I/O statements. This entails a search in an explosively large parameter space. We formulate the problem as a nonlinear optimization problem and use a discrete constraint solver to generate optimized out-of-core code. The solution generated using the discrete constraint solver consistently outperforms other approaches by up to a factor of four. Measurements on sequential and parallel versions of the generated code demonstrate the effectiveness of the approach.