Speeding up Nek5000 with autotuning and specialization

  • Authors:
  • Jaewook Shin;Mary W. Hall;Jacqueline Chame;Chun Chen;Paul F. Fischer;Paul D. Hovland

  • Affiliations:
  • Argonne National Laboratory, Argonne, IL;University of Utah, Salt Lake City, UT;USC / ISI, Marina del Rey, CA;University of Utah, Salt Lake City, UT;Argonne National Laboratory, Argonne, IL;Argonne National Laboratory, Argonne, IL

  • Venue:
  • Proceedings of the 24th ACM International Conference on Supercomputing
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Autotuning technology has emerged recently as a systematic process for evaluating alternative implementations of a computation, in order to select the best-performing solution for a particular architecture. Specialization optimizes code customized to a particular class of input data set. In this paper, we demonstrate how compiler-based autotuning that incorporates specialization for expected data set sizes of key computations can be used to speed up Nek5000, a spectral-element code. Nek5000 makes heavy use of what are effectively Basic Linear Algebra Subroutine (BLAS) calls, but for very small matrices. Through autotuning and specialization, we can achieve significant performance gains over hand-tuned libraries (e.g., Goto, ATLAS, and ACML BLAS). Additional performance gains are obtained from using higher-level compiler optimizations that aggregate multiple BLAS calls. We demonstrate more than 2.2X performance gains on an Opteron over the original manually tuned implementation, and speedups of up to 1.26X on the entire application running on 256 nodes of the Cray XT5 Jaguar system at Oak Ridge.