Efficient search-space pruning for integrated fusion and tiling transformations

Authors:
Xiaoyang Gao;Sriram Krishnamoorthy;Swarup Kumar Sahoo;Chi-Chung Lam;Gerald Baumgartner;J. Ramanujam;P. Sadayappan
Affiliations:
Department of Computer Science and Engineering, The Ohio State University, Columbus, OH;Department of Computer Science and Engineering, The Ohio State University, Columbus, OH;Department of Computer Science and Engineering, The Ohio State University, Columbus, OH;Department of Computer Science and Engineering, The Ohio State University, Columbus, OH;Department of Computer Science, Louisiana State University, Baton Rouge, LA;Department of Electrical and Computer Engineering and, Center for Computation and Technology, Louisiana State University, Baton Rouge, LA;Department of Computer Science and Engineering, The Ohio State University, Columbus, OH
Venue:
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Year:
2005

Citing 15
Cited 0

Maximizing parallelism and minimizing synchronization with affine partitions

Parallel Computing - Special issues on languages and compilers for parallel computers
New tiling techniques to improve cache temporal locality

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Fast greedy weighted fusion

Proceedings of the 14th international conference on Supercomputing
Synthesizing transformations for locality enhancement of imperfectly-nested loop nests

Proceedings of the 14th international conference on Supercomputing
Tiling imperfectly-nested loop nests

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Space-time trade-off optimization for a class of electronic structure calculations

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Towards Automatic Synthesis of High-Performance Codes for Electronic Structure Calculations: Data Locality Optimization

HiPC '01 Proceedings of the 8th International Conference on High Performance Computing
Collective Loop Fusion for Array Contraction

Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
A high-level approach to synthesis of high-performance codes for quantum chemistry

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Global Communication Optimization for Tensor Contraction Expressions under Memory Constraints

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Performance optimization of a class of loops implementing multidimensional integrals

Performance optimization of a class of loops implementing multidimensional integrals
Improving effective bandwidth through compiler enhancement of global cache reuse

Journal of Parallel and Distributed Computing
Integrated Loop Optimizations for Data Locality Enhancement of Tensor Contraction Expressions

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Efficient synthesis of out-of-core algorithms using a nonlinear optimization solver

Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium

Quantified Score

Hi-index	0.00

Visualization

Abstract

Compile-time optimizations involve a number of transformations such as loop permutation, fusion, tiling, array contraction, etc. Determination of the choice of these transformations that minimizes the execution time is a challenging task. We address this problem in the context of tensor contraction expressions involving arrays too large to fit in main memory. Domain-specific features of the computation are exploited to develop an integrated framework that facilitates the exploration of the entire search space of optimizations. In this paper, we discuss the exploration of the space of loop fusion and tiling transformations in order to minimize the disk I/O cost. These two transformations are integrated and pruning strategies are presented that significantly reduce the number of loop structures to be evaluated for subsequent transformations. The evaluation of the framework using representative contraction expressions from quantum chemistry shows a dramatic reduction in the size of the search space using the strategies presented.