High performance discrete Fourier transforms on graphics processors

Authors:
Naga K. Govindaraju;Brandon Lloyd;Yuri Dotsenko;Burton Smith;John Manferdelli
Affiliations:
Microsoft Corporation;Microsoft Corporation;Microsoft Corporation;Microsoft Corporation;Microsoft Corporation
Venue:
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Year:
2008

Citing 7
Cited 36

FFTs in external or hierarchical memory

The Journal of Supercomputing
Computational frameworks for the fast Fourier transform

Computational frameworks for the fast Fourier transform
The FFT on a GPU

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
The potential of the cell processor for scientific computing

Proceedings of the 3rd conference on Computing frontiers
A memory model for scientific algorithms on graphics processors

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
ShaderX2: Shader Programming Tips and Tricks with DirectX 9.0

ShaderX2: Shader Programming Tips and Tricks with DirectX 9.0
FFTC: fastest Fourier transform for the IBM cell broadband engine

HiPC'07 Proceedings of the 14th international conference on High performance computing

GPU acceleration of a production molecular docking code

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Experiences with Mapping Non-linear Memory Access Patterns into GPUs

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Efficient Multiplication of Polynomials on Graphics Hardware

APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
Auto-tuning 3-D FFT library for CUDA GPUs

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Interactive sound rendering

ACM SIGGRAPH 2009 Courses
An adaptive performance modeling tool for GPU architectures

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
High-performance signal processing on emerging many-core architectures using CUDA

ICME'09 Proceedings of the 2009 IEEE international conference on Multimedia and Expo
GPU accelerated simulations of bluff body flows using vortex particle methods

Journal of Computational Physics
State-of-the-art in heterogeneous computing

Scientific Programming
A GPGPU compiler for memory optimization and parallelism management

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
An empirically tuned 2D and 3D FFT library on CUDA GPU

Proceedings of the 24th ACM International Conference on Supercomputing
Large-scale FFT on GPU clusters

Proceedings of the 24th ACM International Conference on Supercomputing
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Proceedings of the 37th annual international symposium on Computer architecture
Fitting FFT onto an energy efficient massively parallel architecture

Proceedings of the Second International Forum on Next-Generation Multicore/Manycore Technologies
Parallel simulation for parameter estimation of optical tissue properties

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
GPU-based FFT computation for multi-gigabit wirelessHD baseband processing

EURASIP Journal on Wireless Communications and Networking
Auto-tuning of fast fourier transform on graphics processors

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Power and Performance Characterization of Computational Kernels on the GPU

GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmark

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Using GPUs to compute large out-of-card FFTs

Proceedings of the international conference on Supercomputing
Introducing scalable quantum approaches in language representation

QI'11 Proceedings of the 5th international conference on Quantum interaction
PARRAY: a unifying array representation for heterogeneous parallelism

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
High performance 3-D FFT using multiple CUDA GPUs

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
On the use of small 2d convolutions on GPUs

ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
A unified optimizing compiler framework for different GPGPU architectures

ACM Transactions on Architecture and Code Optimization (TACO)
Automatic restructuring of GPU kernels for exploiting inter-thread data locality

CC'12 Proceedings of the 21st international conference on Compiler Construction
An FFT performance model for optimizing general-purpose processor architecture

Journal of Computer Science and Technology - Special issue on Community Analysis and Information Recommendation
A transpose-free in-place SIMD optimized FFT

ACM Transactions on Architecture and Code Optimization (TACO)
Shared memory multiplexing: a novel way to improve GPGPU throughput

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Scalable multi-GPU 3-D FFT for TSUBAME 2.0 supercomputer

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
GPU optimization of convolution for large 3-d real images

ACIVS'12 Proceedings of the 14th international conference on Advanced Concepts for Intelligent Vision Systems
Grex: An efficient MapReduce framework for graphics processing units

Journal of Parallel and Distributed Computing
Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
GPU-based approaches for real-time sound source localization using the SRP-PHAT algorithm

International Journal of High Performance Computing Applications
CUDA-NP: realizing nested thread-level parallelism in GPGPU applications

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Theoretical analysis of classic algorithms on highly-threaded many-core GPUs

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present novel algorithms for computing discrete Fourier transforms with high performance on GPUs. We present hierarchical, mixed radix FFT algorithms for both power-of-two and non-power-of-two sizes. Our hierarchical FFT algorithms efficiently exploit shared memory on GPUs using a Stockham formulation. We reduce the memory transpose overheads in hierarchical algorithms by combining the transposes into a block-based multi-FFT algorithm. For non-power-of-two sizes, we use a combination of mixed radix FFTs of small primes and Bluestein's algorithm. We use modular arithmetic in Bluestein's algorithm to improve the accuracy. We implemented our algorithms using the NVIDIA CUDA API and compared their performance with NVIDIA's CUFFT library and an optimized CPU-implementation (Intel's MKL) on a high-end quad-core CPU. On an NVIDIA GPU, we obtained performance of up to 300 GFlops, with typical performance improvements of 2--4x over CUFFT and 8--40x improvement over MKL for large sizes.