Empirical performance model-driven data layout optimization and library call selection for tensor contraction expressions

Authors:
Qingda Lu;Xiaoyang Gao;Sriram Krishnamoorthy;Gerald Baumgartner;J. Ramanujam;P. Sadayappan
Affiliations:
Department of Computer Science and Engineering, The Ohio State University, Columbus, OH43210, USA;Department of Computer Science and Engineering, The Ohio State University, Columbus, OH43210, USA;Department of Computer Science and Engineering, The Ohio State University, Columbus, OH43210, USA;Department of Computer Science, Louisiana State University, Baton Rouge, LA70803, USA;Department of Electrical and Computer Engineering, Louisiana State University, Baton Rouge, LA70803, USA;Department of Computer Science and Engineering, The Ohio State University, Columbus, OH43210, USA
Venue:
Journal of Parallel and Distributed Computing
Year:
2012

Citing 42
Cited 0

A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Unifying data and control transformations for distributed shared-memory machines

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Global arrays: a nonuniform memory access programming model for high-performance computers

The Journal of Supercomputing
Generalized Cannon's algorithm for parallel matrix multiplication

ICS '97 Proceedings of the 11th international conference on Supercomputing
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
Improving locality using loop and data transformations in an integrated framework

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts

IEEE Transactions on Parallel and Distributed Systems
Nonsingular Data Transformations: Definition, Validity, and Applications

International Journal of Parallel Programming
Static and Dynamic Locality Optimizations Using Integer Linear Programming

IEEE Transactions on Parallel and Distributed Systems
Space-time trade-off optimization for a class of electronic structure calculations

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Reducing and Vectorizing Procedures for Telescoping Languages

International Journal of Parallel Programming
A Modal Model of Memory

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Optimization of Memory Usage Requirement for a Class of Loops Implementing Multi-dimensional Integrals

LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation

Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
A Feasibility Study in Iterative Compilation

ISHPC '99 Proceedings of the Second International Symposium on High Performance Computing
Cache Models for Iterative Compilation

Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
A comparison of empirical and model-driven optimization

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation

PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
Global Communication Optimization for Tensor Contraction Expressions under Memory Constraints

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
A cellular computer to implement the kalman filter algorithm

A cellular computer to implement the kalman filter algorithm
Performance optimization of a class of loops implementing multidimensional integrals

Performance optimization of a class of loops implementing multidimensional integrals
Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy

Proceedings of the international symposium on Code generation and optimization
Automatic Type-Driven Library Generation for Telescoping Languages

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
The effect of cache models on iterative compilation for combined tiling and unrolling: Research Articles

Concurrency and Computation: Practice & Experience - Compilers for Parallel Computers
Performance modeling and optimization of parallel out-of-core tensor contractions

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Tuning High Performance Kernels through Empirical Compilation

ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
Sparsity: Optimization Framework for Sparse Matrix Kernels

International Journal of High Performance Computing Applications
Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit

International Journal of High Performance Computing Applications
High Performance Remote Memory Access Communication: The Armci Approach

International Journal of High Performance Computing Applications
Efficient synthesis of out-of-core algorithms using a nonlinear optimization solver

Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
Combining analytical and empirical approaches in tuning matrix transposition

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Efficient search-space pruning for integrated fusion and tiling transformations: Research Articles

Concurrency and Computation: Practice & Experience - Current Trends in Compilers for Parallel Computers (CPC2006)
Integrated compiler optimizations for tensor contractions

Integrated compiler optimizations for tensor contractions
Improving performance of optimized kernels through fast instantiations of templates

Concurrency and Computation: Practice & Experience - Compilers for Parallel Computers 2007 Workshop (CPC 2007)
Compositional approach applied to loop specialization

Concurrency and Computation: Practice & Experience - Compilers for Parallel Computers 2007 Workshop (CPC 2007)
Data layout optimization techniques for modern and emerging architectures

Data layout optimization techniques for modern and emerging architectures
A scalable auto-tuning framework for compiler optimization

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Automatic creation of tile size selection models

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Memory-optimal evaluation of expression trees involving large objects

Computer Languages, Systems and Structures
Deciding where to call performance libraries

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Evaluating iterative compilation

LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Empirical optimizers like ATLAS have been very effective in optimizing computational kernels in libraries. The best choice of parameters such as tile size and degree of loop unrolling is determined in ATLAS by executing different versions of the computation. In contrast, optimizing compilers use a model-driven approach to program transformation. While the model-driven approach of optimizing compilers is generally orders of magnitude faster than ATLAS-like library generators, its effectiveness can be limited by the accuracy of the performance models used. In this paper, we describe an approach where a class of computations is modeled in terms of constituent operations that are empirically measured, thereby allowing modeling of the overall execution time. The performance model with empirically determined cost components is used to select library calls and choose data layout transformations in the context of the Tensor Contraction Engine, a compiler for a high-level domain-specific language for expressing computational models in quantum chemistry. The effectiveness of the approach is demonstrated through experimental measurements on representative computations from quantum chemistry.