Using GPUs to compute large out-of-card FFTs

Authors:
Liang Gu;Jakob Siegel;Xiaoming Li
Affiliations:
University of Delaware, Newark, DE, USA;University of Delaware, Newark, DE, USA;University of Delaware, Newark, DE, USA
Venue:
Proceedings of the international conference on Supercomputing
Year:
2011

Citing 9
Cited 2

FFTs in external of hierarchical memory

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
A fast Fourier transform compiler

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
High-Performance Radix-2, 3 and 5 Parallel 1-D Complex FFT Algorithms for Distributed-Memory Parallel Computers

The Journal of Supercomputing
The Scalability of FFT on Parallel Computers

IEEE Transactions on Parallel and Distributed Systems
FFT program generation for shared memory: SMP and multicore

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
High performance discrete Fourier transforms on graphics processors

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Auto-tuning 3-D FFT library for CUDA GPUs

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
An empirically tuned 2D and 3D FFT library on CUDA GPU

Proceedings of the 24th ACM International Conference on Supercomputing
Large-scale FFT on GPU clusters

Proceedings of the 24th ACM International Conference on Supercomputing

A transpose-free in-place SIMD optimized FFT

ACM Transactions on Architecture and Code Optimization (TACO)
Scaling large-data computations on multi-GPU accelerators

Proceedings of the 27th international ACM conference on International conference on supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The optimization of Fast Fourier Transfer (FFT) problems that can fit into GPU memory has been studied extensively. Such on-card FFT libraries like CUFFT can generally achieve much better performance than their counterparts on a CPU, as the data transfer between CPU and GPU is usually not counted in their performance. This high performance, however, is limited by the GPU memory size. When the FFT problem size increases, the data transfer between system and GPU memory can comprise a substantial part of the overall execution time. Therefore, optimizations for FFT problems that outgrow the GPU memory can not bypass the tuning of data transfer between CPU and GPU. However, no prior study has attacked this problem. This paper is the first effort of using GPUs to efficiently compute large FFTs in the CPU memory of a single compute node. In this paper, the performance of the PCI bus during the transfer of a batch of FFT subarrays is studied and a blocked buffer algorithm is proposed to improve the effective bandwidth. More importantly, several FFT decomposition algorithms are proposed so as to increase the data locality, further improve the PCI bus efficiency and balance computation between kernels. By integrating the above two methods, we demonstrate an out-of-card FFT optimization strategy and develop an FFT library that efficiently computes large 1D, 2D and 3D FFTs that can not fit into the GPU's memory. On three of the latest GPUs, our large FFT library achieves much better double precision performance than two of the most efficient CPU based libraries, FFTW and Intel MKL. On average, our large FFTs on a single GeForce GTX480 are 46% faster than FFTW and 57% faster than MKL with multiple threads running on a four-core Intel i7 CPU. The speedup on a Tesla C2070 is 1.93x and 2.11x over FFTW and MKL. A peak performance of 21GFLOPS is achieved for a 2D FFT of size 2048x65536 on C2070 with double precision.