Tuning High Performance Kernels through Empirical Compilation

Authors:
David B. Whalley
Affiliations:
Florida State University
Venue:
ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
Year:
2005

Citing 0
Cited 15

Minimizing execution time in MPI programs on an energy-constrained, power-scalable cluster

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Combining analytical and empirical approaches in tuning matrix transposition

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Achieving accurate and context-sensitive timing for code optimization

Software—Practice & Experience
Exploring the Optimization Space of Dense Linear Algebra Kernels

Languages and Compilers for Parallel Computing
Automated transformation for performance-critical kernels

LCSD '07 Proceedings of the 2007 Symposium on Library-Centric Software Design
Parametric multi-level tiling of imperfectly nested loops

Proceedings of the 23rd international conference on Supercomputing
Automating the generation of composed linear algebra kernels

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
An adaptive performance modeling tool for GPU architectures

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Scaling LAPACK panel operations using parallel cache assignment

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
AARTS: low overhead online adaptive auto-tuning

Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Empirical performance model-driven data layout optimization and library call selection for tensor contraction expressions

Journal of Parallel and Distributed Computing
Loop transformation recipes for code generation and auto-tuning

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Effective source-to-source outlining to support whole program empirical optimization

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Parameterized micro-benchmarking: an auto-tuning approach for complex applications

Proceedings of the 9th conference on Computing Frontiers
Vectorization past dependent branches through speculation

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Quantified Score

Hi-index	0.03

Visualization

Abstract

There are a few application areas which remain almost untouched by the historical and continuing advancement of compilation research. For the extremes of optimization required for high performance computing on one end, and embedded systems at the opposite end of the spectrum, many critical routines are still hand-tuned, often directly in assembly. At the same time, architecture implementations are performing an increasing number of compiler-like transformations in hardware, making it harder to predict the performance impact of a given series of optimizations applied at the ISA level. These issues, together with the rate of hardware evolution dictated by Mooreýs Law, make it almost impossible to keep key kernels running at peakefficiency. Automated empirical systems, where direct timings are used to guide optimization, have provided the most successful response to these challenges. This paper describes our approach to performing empirical optimization, which utilizes a low-level iterative compilation framework specialized for optimizing high performance computing kernels. We present results showing that this approach can not only provide speedups over traditional optimizing compilers, but can improve overall performance when compared to the best hand-tuned kernels selected by the empirical search of our well-known ATLAS package.