A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
A model and compilation strategy for out-of-core data parallel programs
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Data and computation transformations for multiprocessors
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Compiler cache optimizations for banded matrix problems
ICS '95 Proceedings of the 9th international conference on Supercomputing
Improving data locality with loop transformations
ACM Transactions on Programming Languages and Systems (TOPLAS)
Automatic optimization of communication in compiling out-of-core stencil codes
ICS '96 Proceedings of the 10th international conference on Supercomputing
Automatic compiler-inserted I/O prefetching for out-of-core applications
OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Data-centric multi-level blocking
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Automatic parallel I/O performance optimization in Panda
Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
Eliminating conflict misses for high performance architectures
ICS '98 Proceedings of the 12th international conference on Supercomputing
Compilation techniques for out-of-core parallel computations
Parallel Computing - Special issues on languages and compilers for parallel computers
Combining loop transformations considering caches and scheduling
International Journal of Parallel Programming - Special issue: MICRO-29, 29th annual IEEE/ACM international symposium on microarchitecture
New tiling techniques to improve cache temporal locality
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
An experimental evaluation of tiling and shackling for memory hierarchy management
ICS '99 Proceedings of the 13th international conference on Supercomputing
Quantifying the multi-level nature of tiling interactions
International Journal of Parallel Programming
Cache miss equations: a compiler framework for analyzing and tuning memory behavior
ACM Transactions on Programming Languages and Systems (TOPLAS)
Synthesizing transformations for locality enhancement of imperfectly-nested loop nests
Proceedings of the 14th international conference on Supercomputing
IEEE Transactions on Parallel and Distributed Systems
Loop optimization for a class of memory-constrained computations
ICS '01 Proceedings of the 15th international conference on Supercomputing
Blocking and array contraction across arbitrarily nested loops using affine partitioning
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Space-time trade-off optimization for a class of electronic structure calculations
PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Synthesizing Transformations for Locality Enhancement of Imperfectly-Nested Loop Nests
International Journal of Parallel Programming
An I/O-Conscious Tiling Strategy for Disk-Resident Data Sets
The Journal of Supercomputing
HiPC '01 Proceedings of the 8th International Conference on High Performance Computing
Memory-Optimal Evaluation of Expression Trees Involving Large Objects
HiPC '99 Proceedings of the 6th International Conference on High Performance Computing
LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
Optimal Anytime Constrained Simulated Annealing for Constrained Global Optimization
CP '02 Proceedings of the 6th International Conference on Principles and Practice of Constraint Programming
A high-level approach to synthesis of high-performance codes for quantum chemistry
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Global Communication Optimization for Tensor Contraction Expressions under Memory Constraints
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Constrained genetic algorithms and their applications in nonlinear constrained optimization
ICTAI '00 Proceedings of the 12th IEEE International Conference on Tools with Artificial Intelligence
Techniques for compiling i/o intensive parallel programs
Techniques for compiling i/o intensive parallel programs
Performance optimization of a class of loops implementing multidimensional integrals
Performance optimization of a class of loops implementing multidimensional integrals
Efficient search-space pruning for integrated fusion and tiling transformations
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
We address the problem of efficient out-of-core code generation for a special class of imperfectly nested loops encoding tensor contractions arising in quantum chemistry computations. These loops operate on arrays too large to fit in physical memory. The problem involves determining optimal tiling of loops and placement of disk I/O statements. This entails a search in an explosively large parameter space. We formulate the problem as a nonlinear optimization problem and use a discrete constraint solver to generate optimized out-of-core code. The solution generated using the discrete constraint solver consistently outperforms other approaches by up to a factor of four. Measurements on sequential and parallel versions of the generated code demonstrate the effectiveness of the approach.