Code Generators for Automatic Tuning of Numerical Kernels: Experiences with FFTW

Authors:
Rich Vuduc;James Demmel
Affiliations:
-;-
Venue:
SAIG '00 Proceedings of the International Workshop on Semantics, Applications, and Implementation of Program Generation
Year:
2000

Citing 13
Cited 0

An extended set of FORTRAN basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Discrete cosine transform: algorithms, advantages, applications

Discrete cosine transform: algorithms, advantages, applications
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
Locality of Reference in LU Decomposition with Partial Pivoting

SIAM Journal on Matrix Analysis and Applications
Recursion leads to automatic variable blocking for dense linear-algebra algorithms

IBM Journal of Research and Development
Discrete-time signal processing (2nd ed.)

Discrete-time signal processing (2nd ed.)
A fast Fourier transform compiler

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Basic Linear Algebra Subprograms for Fortran Usage

ACM Transactions on Mathematical Software (TOMS)
The Matrix Template Library: A Generic Programming Approach to High Performance Numerical Linear Algebra

ISCOPE '98 Proceedings of the Second International Symposium on Computing in Object-Oriented Parallel Environments
Optimizing the performance of sparse matrix-vector multiplication

Optimizing the performance of sparse matrix-vector multiplication
Applying recursion to serial and parallel QR factorization leads to better performance

IBM Journal of Research and Development
DCT algorithms for composite sequence lengths

IEEE Transactions on Signal Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Achieving peak performance in important numerical kernels such as dense matrix multiply or sparse-matrix vector multiplication usually requires extensive, machine-dependent tuning by hand. In response, a number automatic tuning systems have been developed which typically operate by (1) generating multiple implementations of a kernel, and (2) empirically selecting an optimal implementation. One such system is FFTW (Fastest Fourier Transform in the West) for the discrete Fourier transform. In this paper, we review FFTW's inner workings with an emphasis on its code generator, and report on our empirical evaluation of the system on two different hardware and compiler platforms. We then describe a number of our own extensions to the FFTW code generator that compute efficient discrete cosine transforms and show promising speed-ups over a vendor-tuned library. We also comment on current opportunities to develop tuning systems in the spirit of FFTW for other widely-used kernels.