Minimizing execution time in MPI programs on an energy-constrained, power-scalable cluster
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Combining analytical and empirical approaches in tuning matrix transposition
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Achieving accurate and context-sensitive timing for code optimization
Software—Practice & Experience
Exploring the Optimization Space of Dense Linear Algebra Kernels
Languages and Compilers for Parallel Computing
Automated transformation for performance-critical kernels
LCSD '07 Proceedings of the 2007 Symposium on Library-Centric Software Design
Parametric multi-level tiling of imperfectly nested loops
Proceedings of the 23rd international conference on Supercomputing
Automating the generation of composed linear algebra kernels
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
An adaptive performance modeling tool for GPU architectures
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Scaling LAPACK panel operations using parallel cache assignment
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
AARTS: low overhead online adaptive auto-tuning
Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Journal of Parallel and Distributed Computing
Loop transformation recipes for code generation and auto-tuning
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Effective source-to-source outlining to support whole program empirical optimization
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Parameterized micro-benchmarking: an auto-tuning approach for complex applications
Proceedings of the 9th conference on Computing Frontiers
Vectorization past dependent branches through speculation
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Hi-index | 0.03 |
There are a few application areas which remain almost untouched by the historical and continuing advancement of compilation research. For the extremes of optimization required for high performance computing on one end, and embedded systems at the opposite end of the spectrum, many critical routines are still hand-tuned, often directly in assembly. At the same time, architecture implementations are performing an increasing number of compiler-like transformations in hardware, making it harder to predict the performance impact of a given series of optimizations applied at the ISA level. These issues, together with the rate of hardware evolution dictated by Mooreýs Law, make it almost impossible to keep key kernels running at peakefficiency. Automated empirical systems, where direct timings are used to guide optimization, have provided the most successful response to these challenges. This paper describes our approach to performing empirical optimization, which utilizes a low-level iterative compilation framework specialized for optimizing high performance computing kernels. We present results showing that this approach can not only provide speedups over traditional optimizing compilers, but can improve overall performance when compared to the best hand-tuned kernels selected by the empirical search of our well-known ATLAS package.