A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Unifying data and control transformations for distributed shared-memory machines
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data and computation transformations for multiprocessors
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Global arrays: a nonuniform memory access programming model for high-performance computers
The Journal of Supercomputing
Generalized Cannon's algorithm for parallel matrix multiplication
ICS '97 Proceedings of the 11th international conference on Supercomputing
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
ICS '97 Proceedings of the 11th international conference on Supercomputing
Improving locality using loop and data transformations in an integrated framework
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts
IEEE Transactions on Parallel and Distributed Systems
Nonsingular Data Transformations: Definition, Validity, and Applications
International Journal of Parallel Programming
Static and Dynamic Locality Optimizations Using Integer Linear Programming
IEEE Transactions on Parallel and Distributed Systems
Space-time trade-off optimization for a class of electronic structure calculations
PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Reducing and Vectorizing Procedures for Telescoping Languages
International Journal of Parallel Programming
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation
Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
A Feasibility Study in Iterative Compilation
ISHPC '99 Proceedings of the Second International Symposium on High Performance Computing
Cache Models for Iterative Compilation
Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
A comparison of empirical and model-driven optimization
PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation
PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
Global Communication Optimization for Tensor Contraction Expressions under Memory Constraints
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
A cellular computer to implement the kalman filter algorithm
A cellular computer to implement the kalman filter algorithm
Performance optimization of a class of loops implementing multidimensional integrals
Performance optimization of a class of loops implementing multidimensional integrals
Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy
Proceedings of the international symposium on Code generation and optimization
Automatic Type-Driven Library Generation for Telescoping Languages
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Concurrency and Computation: Practice & Experience - Compilers for Parallel Computers
Performance modeling and optimization of parallel out-of-core tensor contractions
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Tuning High Performance Kernels through Empirical Compilation
ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
Sparsity: Optimization Framework for Sparse Matrix Kernels
International Journal of High Performance Computing Applications
Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit
International Journal of High Performance Computing Applications
High Performance Remote Memory Access Communication: The Armci Approach
International Journal of High Performance Computing Applications
Efficient synthesis of out-of-core algorithms using a nonlinear optimization solver
Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
Combining analytical and empirical approaches in tuning matrix transposition
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Efficient search-space pruning for integrated fusion and tiling transformations: Research Articles
Concurrency and Computation: Practice & Experience - Current Trends in Compilers for Parallel Computers (CPC2006)
Integrated compiler optimizations for tensor contractions
Integrated compiler optimizations for tensor contractions
Improving performance of optimized kernels through fast instantiations of templates
Concurrency and Computation: Practice & Experience - Compilers for Parallel Computers 2007 Workshop (CPC 2007)
Compositional approach applied to loop specialization
Concurrency and Computation: Practice & Experience - Compilers for Parallel Computers 2007 Workshop (CPC 2007)
Data layout optimization techniques for modern and emerging architectures
Data layout optimization techniques for modern and emerging architectures
A scalable auto-tuning framework for compiler optimization
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Automatic creation of tile size selection models
Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Memory-optimal evaluation of expression trees involving large objects
Computer Languages, Systems and Structures
Deciding where to call performance libraries
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Evaluating iterative compilation
LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing
Hi-index | 0.00 |
Empirical optimizers like ATLAS have been very effective in optimizing computational kernels in libraries. The best choice of parameters such as tile size and degree of loop unrolling is determined in ATLAS by executing different versions of the computation. In contrast, optimizing compilers use a model-driven approach to program transformation. While the model-driven approach of optimizing compilers is generally orders of magnitude faster than ATLAS-like library generators, its effectiveness can be limited by the accuracy of the performance models used. In this paper, we describe an approach where a class of computations is modeled in terms of constituent operations that are empirically measured, thereby allowing modeling of the overall execution time. The performance model with empirically determined cost components is used to select library calls and choose data layout transformations in the context of the Tensor Contraction Engine, a compiler for a high-level domain-specific language for expressing computational models in quantum chemistry. The effectiveness of the approach is demonstrated through experimental measurements on representative computations from quantum chemistry.