Automated transformation for performance-critical kernels

Authors:
Qing Yi;R. Clint Whaley
Affiliations:
University of Texas at San Antonio;University of Texas at San Antonio
Venue:
LCSD '07 Proceedings of the 2007 Symposium on Library-Centric Software Design
Year:
2007

Citing 17
Cited 6

Partial evaluation and automatic program generation

Partial evaluation and automatic program generation
C: a language for high-level, efficient, and machine-independent dynamic code generation

POPL '96 Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
DyC: an expressive annotation-directed dynamic compiler for C

Theoretical Computer Science - Partial evaluation and semantics-based program manipulation
Better tiling and array contraction for compiling scientific programs

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation

PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
Transforming Complex Loop Nests for Locality

The Journal of Supercomputing
Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy

Proceedings of the international symposium on Code generation and optimization
Predicting Unroll Factors Using Supervised Classification

Proceedings of the international symposium on Code generation and optimization
The science of deriving dense linear algebra algorithms

ACM Transactions on Mathematical Software (TOMS)
Minimizing development and maintenance costs in supporting persistently optimized BLAS

Software—Practice & Experience - Research Articles
Tuning High Performance Kernels through Empirical Compilation

ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
Statistical Models for Empirical Search-Based Performance Tuning

International Journal of High Performance Computing Applications
Facilitating the search for compositions of program transformations

Proceedings of the 19th annual international conference on Supercomputing
A survey of strategies in rule-based program transformation systems

Journal of Symbolic Computation
A cache-conscious profitability model for empirical tuning of loop fusion

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
A language for the compact representation of multiple program versions

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
A practical method for quickly evaluating program optimizations

HiPEAC'05 Proceedings of the First international conference on High Performance Embedded Architectures and Compilers

Exploring the Optimization Space of Dense Linear Algebra Kernels

Languages and Compilers for Parallel Computing
PetaBricks: a language and compiler for algorithmic choice

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Exposing tunable parameters in multi-threaded numerical code

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Automated empirical tuning of scientific codes for performance and power consumption

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Automated programmable control and parameterization of compiler optimizations

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
POET: a scripting language for applying parameterized source-to-source program transformations

Software—Practice & Experience

Quantified Score

Hi-index	0.00

Visualization

Abstract

The performance of many scientific applications depends on a small number of key computational kernels which require a level of efficiency rarely satisfied by existing native compilers. We present a new approach to high performance kernel optimization, where a general-purpose transformation engine automates the production of highly efficient library routines. The library routines are then empirically tested until an implementation with a satisfactory performance level is found. Our framework requires an annotated kernel specification and can automatically produce optimized implementations based on tuning parameters controlled by a search driver. The transformation engine includes an extensive suite of optimizations which can be easily expanded using a custom transformation language. We have applied our framework to generate code for key linear algebra kernels and have achieved similar performance as that achieved by ATLAS's highly tuned kernels. In several cases, our kernels were faster than ATLAS's native kernels; we have made these kernels available to ATLAS, which results in speedups for the ATLAS library, as we show.