An empirically tuned 2D and 3D FFT library on CUDA GPU

  • Authors:
  • Liang Gu;Xiaoming Li;Jakob Siegel

  • Affiliations:
  • University of Delaware, Newark, DE;University of Delaware, Newark, DE;University of Delaware, Newark, DE

  • Venue:
  • Proceedings of the 24th ACM International Conference on Supercomputing
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, a Cooley-Tukey algorithm based multidimensional FFT computation framework on GPU is proposed. This framework generalizes the decomposition of multi-dimensional FFT on GPUs using an I/O tensor representation, and therefore provides a systematic description of possible FFT implementations on GPUs. The framework is geared to the efficiency of multi-dimensional FFT on GPU architectures. In particular, no global transposition among dimensions is performed and some previously unnoticed grouping and commutability of multiple dimensions are highlighted in order to reduce the number of computational kernels and minimize the number of global memory accesses. Important architectural factors and constraints of CUDA, such as coalesced access, bank conflicts and register pressure are also considered in this framework. Moreover, we adapt codelets, a straight-line style FFT implementation originally developed in FFTW, into our framework and prove that they are highly efficient on GPUs. A 2D and 3D FFT library, currently supporting power-of-two sizes, is implemented on this framework and empirically-tuned results are compared with CUFFT and other recent publications on three NVIDIA GPUs. On a high-end NVIDIA GPU, GeForce GTX280, our 2D implementation is 2.8x faster than CUFFT and 1.6x faster than the best previously published results on average. Our 3D FFT implementation achieves 22.7x speed up over CUFFT on average. Furthermore both implementations show better precision than CUFFT. This library and its framework are potentially extensible to more general FFT problem sizes and other parallel architectures as well.