Speeding up Nek5000 with autotuning and specialization

Authors:
Jaewook Shin;Mary W. Hall;Jacqueline Chame;Chun Chen;Paul F. Fischer;Paul D. Hovland
Affiliations:
Argonne National Laboratory, Argonne, IL;University of Utah, Salt Lake City, UT;USC / ISI, Marina del Rey, CA;University of Utah, Salt Lake City, UT;Argonne National Laboratory, Argonne, IL;Argonne National Laboratory, Argonne, IL
Venue:
Proceedings of the 24th ACM International Conference on Supercomputing
Year:
2010

Citing 18
Cited 8

Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
Terascale spectral element algorithms and implementations

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
FLAME: Formal Linear Algebra Methods Environment

ACM Transactions on Mathematical Software (TOMS)
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Finding effective optimization phase sequences

Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems
The Fastest Fourier Transform in the West

The Fastest Fourier Transform in the West
High-Performance Matrix Multiplication Algorithms for Architectures withHierarchical Memories

High-Performance Matrix Multiplication Algorithms for Architectures withHierarchical Memories
Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy

Proceedings of the international symposium on Code generation and optimization
A Portable Programming Interface for Performance Evaluation on Modern Processors

International Journal of High Performance Computing Applications
Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time

Proceedings of the International Symposium on Code Generation and Optimization
Model-guided empirical optimization for memory hierarchy

Model-guided empirical optimization for memory hierarchy
Iterative optimization in the polyhedral model: part ii, multidimensional time

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Outer-loop vectorization: revisited for short SIMD architectures

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Computer Generation of General Size Linear Transform Libraries

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
A scalable auto-tuning framework for compiler optimization

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Annotation-based empirical performance tuning using Orio

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Iterative compilation with kernel exploration

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Loop transformation recipes for code generation and auto-tuning

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing

A programming language interface to describe transformations and code generation

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
An idiom-finding tool for increasing productivity of accelerators

Proceedings of the international conference on Supercomputing
AARTS: low overhead online adaptive auto-tuning

Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Polyhedra scanning revisited

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
A script-based autotuning compiler system to generate high-performance CUDA code

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Towards making autotuning mainstream

International Journal of High Performance Computing Applications
Tools for machine-learning-based empirical autotuning and specialization

International Journal of High Performance Computing Applications
Towards fully automatic auto-tuning: Leveraging language features of Chapel

International Journal of High Performance Computing Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Autotuning technology has emerged recently as a systematic process for evaluating alternative implementations of a computation, in order to select the best-performing solution for a particular architecture. Specialization optimizes code customized to a particular class of input data set. In this paper, we demonstrate how compiler-based autotuning that incorporates specialization for expected data set sizes of key computations can be used to speed up Nek5000, a spectral-element code. Nek5000 makes heavy use of what are effectively Basic Linear Algebra Subroutine (BLAS) calls, but for very small matrices. Through autotuning and specialization, we can achieve significant performance gains over hand-tuned libraries (e.g., Goto, ATLAS, and ACML BLAS). Additional performance gains are obtained from using higher-level compiler optimizations that aggregate multiple BLAS calls. We demonstrate more than 2.2X performance gains on an Opteron over the original manually tuned implementation, and speedups of up to 1.26X on the entire application running on 256 nodes of the Cray XT5 Jaguar system at Oak Ridge.