Computational frameworks for the fast Fourier transform
Computational frameworks for the fast Fourier transform
A fast Fourier transform compiler
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
A memory model for scientific algorithms on graphics processors
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
High performance discrete Fourier transforms on graphics processors
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Bandwidth intensive 3-D FFT kernel for GPUs using CUDA
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Auto-tuning 3-D FFT library for CUDA GPUs
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
CUDA Memory Optimizations for Large Data-Structures in the Gravit Simulator
ICPPW '09 Proceedings of the 2009 International Conference on Parallel Processing Workshops
Using GPUs to compute large out-of-card FFTs
Proceedings of the international conference on Supercomputing
A GPU-based high-throughput image retrieval algorithm
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Parallelizing SOR for GPGPUs using alternate loop tiling
Parallel Computing
On the communication complexity of 3D FFTs and its implications for Exascale
Proceedings of the 26th ACM international conference on Supercomputing
A transpose-free in-place SIMD optimized FFT
ACM Transactions on Architecture and Code Optimization (TACO)
Computational physics on graphics processing units
PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Hi-index | 0.00 |
In this paper, a Cooley-Tukey algorithm based multidimensional FFT computation framework on GPU is proposed. This framework generalizes the decomposition of multi-dimensional FFT on GPUs using an I/O tensor representation, and therefore provides a systematic description of possible FFT implementations on GPUs. The framework is geared to the efficiency of multi-dimensional FFT on GPU architectures. In particular, no global transposition among dimensions is performed and some previously unnoticed grouping and commutability of multiple dimensions are highlighted in order to reduce the number of computational kernels and minimize the number of global memory accesses. Important architectural factors and constraints of CUDA, such as coalesced access, bank conflicts and register pressure are also considered in this framework. Moreover, we adapt codelets, a straight-line style FFT implementation originally developed in FFTW, into our framework and prove that they are highly efficient on GPUs. A 2D and 3D FFT library, currently supporting power-of-two sizes, is implemented on this framework and empirically-tuned results are compared with CUFFT and other recent publications on three NVIDIA GPUs. On a high-end NVIDIA GPU, GeForce GTX280, our 2D implementation is 2.8x faster than CUFFT and 1.6x faster than the best previously published results on average. Our 3D FFT implementation achieves 22.7x speed up over CUFFT on average. Furthermore both implementations show better precision than CUFFT. This library and its framework are potentially extensible to more general FFT problem sizes and other parallel architectures as well.