Bandwidth intensive 3-D FFT kernel for GPUs using CUDA

Authors:
Akira Nukada;Yasuhiko Ogata;Toshio Endo;Satoshi Matsuoka
Affiliations:
Tokyo Institute of Technology, Tokyo, Japan and Japan Science and Technology Agency, Kawaguchi, Saitama, Japan;Tokyo Institute of Technology, Tokyo, Japan and Japan Science and Technology Agency, Kawaguchi, Saitama, Japan;Tokyo Institute of Technology, Tokyo, Japan and Japan Science and Technology Agency, Kawaguchi, Saitama, Japan;Tokyo Institute of Technology, Tokyo, Japan and National Institute of Informatics, Tokyo, Japan and Japan Science and Technology Agency, Kawaguchi, Saitama, Japan
Venue:
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Year:
2008

Citing 10
Cited 17

Computational frameworks for the fast Fourier transform

Computational frameworks for the fast Fourier transform
Real and complex fast Fourier transforms on the Fujitsu VPP 500

Parallel Computing
Fast matrix multiplies using graphics hardware

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
16.4-Tflops direct numerical simulation of turbulence by a Fourier spectral method on the Earth Simulator

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
The FFT on a GPU

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Cg: a system for programming graphics hardware in a C-like language

ACM SIGGRAPH 2003 Papers
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
Accelerator: using data parallelism to program GPUs for general-purpose uses

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
A memory model for scientific algorithms on graphics processors

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Memory Locality Exploitation Strategies for FFT on the CUDA Architecture

High Performance Computing for Computational Science - VECPAR 2008

Aspects of GPU for general purpose high performance computing

Proceedings of the 2009 Asia and South Pacific Design Automation Conference
GPU acceleration of a production molecular docking code

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
A 32x32x32, spatially distributed 3D FFT in four microseconds on Anton

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Auto-tuning 3-D FFT library for CUDA GPUs

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
High-performance signal processing on emerging many-core architectures using CUDA

ICME'09 Proceedings of the 2009 IEEE international conference on Multimedia and Expo
Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping

Proceedings of the 24th ACM International Conference on Supercomputing
An empirically tuned 2D and 3D FFT library on CUDA GPU

Proceedings of the 24th ACM International Conference on Supercomputing
Large-scale FFT on GPU clusters

Proceedings of the 24th ACM International Conference on Supercomputing
Auto-tuning of fast fourier transform on graphics processors

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
fMRI analysis on the GPU-Possibilities and challenges

Computer Methods and Programs in Biomedicine
PARRAY: a unifying array representation for heterogeneous parallelism

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
High performance 3-D FFT using multiple CUDA GPUs

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
True 4D image denoising on the GPU

Journal of Biomedical Imaging - Special issue on Parallel Computation in Medical Imaging Applications
Scalable multi-GPU 3-D FFT for TSUBAME 2.0 supercomputer

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
GPU optimization of convolution for large 3-d real images

ACIVS'12 Proceedings of the 14th international conference on Advanced Concepts for Intelligent Vision Systems
Optimizing tensor contraction expressions for hybrid CPU-GPU execution

Cluster Computing
Visualizing 3D/4D environmental data using many-core graphics processing units (GPUs) and multi-core central processing units (CPUs)

Computers & Geosciences

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most GPU performance "hypes" have focused around tightly-coupled applications with small memory bandwidth requirements e.g., N-body, but GPUs are also commodity vector machines sporting substantial memory bandwidth; however, effective programming methodologies thereof have been poorly studied. Our new 3-D FFT kernel, written in NVIDIA CUDA, achieves nearly 80 GFLOPS on a top-end GPU, being more than three times faster than any existing FFT implementations on GPUs including CUFFT. Careful programming techniques are employed to fully exploit modern GPU hardware characteristics while overcoming their limitations, including on-chip shared memory utilization, optimizing the number of threads and registers through appropriate localization, and avoiding low-speed stride memory accesses. Our kernel applied to real applications achieves orders of magnitude boost in power&cost vs. performance metrics. The off-card bandwidth limitation is still an issue, which could be alleviated somewhat with application kernels confinement within the card, while ideal solution being facilitation of faster GPU interfaces.