Auto-tuning of fast fourier transform on graphics processors

Authors:
Yuri Dotsenko;Sara S. Baghsorkhi;Brandon Lloyd;Naga K. Govindaraju
Affiliations:
Microsoft Corporation, Redmond, WA, USA;University of Illinois at Urbana-Champaign, Urbana, IL, USA;Microsoft Corporation, Redmond, WA, USA;Microsoft Corporation, Redmond, WA, USA
Venue:
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Year:
2011

Citing 9
Cited 11

Computational frameworks for the fast Fourier transform

Computational frameworks for the fast Fourier transform
The FFT on a GPU

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms

International Journal of High Performance Computing Applications
A memory model for scientific algorithms on graphics processors

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Program optimization space pruning for a multithreaded gpu

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
High performance discrete Fourier transforms on graphics processors

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Bandwidth intensive 3-D FFT kernel for GPUs using CUDA

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Computer generation of fast fourier transforms for the cell broadband engine

Proceedings of the 23rd international conference on Supercomputing
Auto-tuning 3-D FFT library for CUDA GPUs

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

A performance analysis framework for identifying potential benefits in GPGPU applications

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
GPU-based NFA implementation for memory efficient high speed regular expression matching

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
High performance 3-D FFT using multiple CUDA GPUs

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Parameterized micro-benchmarking: an auto-tuning approach for complex applications

Proceedings of the 9th conference on Computing Frontiers
Language identification using multi-core processors

Computer Speech and Language
Scalable multi-GPU 3-D FFT for TSUBAME 2.0 supercomputer

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Optimizing tensor contraction expressions for hybrid CPU-GPU execution

Cluster Computing
Starchart: hardware and software optimization using recursive partitioning regression trees

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Designing and auto-tuning parallel 3-D FFT for computation-communication overlap

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
In-place transposition of rectangular matrices on accelerators

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Multichannel massive audio processing for a generalized crosstalk cancellation and equalization application using GPUs

Integrated Computer-Aided Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an auto-tuning framework for FFTs on graphics processors (GPUs). Due to complex design of the memory and compute subsystems on GPUs, the performance of FFT kernels over the range of possible input parameters can vary widely. We generate several variants for each component of the FFT kernel that, for different cases, are likely to perform well. Our auto-tuner composes variants to generate kernels and selects the best ones. We present heuristics to prune the search space and profile only a small fraction of all possible kernels. We compose optimized kernels to improve the performance of larger FFT computations. We implement the system using the NVIDIA CUDA API and compare its performance to the state-of-the-art FFT libraries. On a range of NVIDIA GPUs and input sizes, our auto-tuned FFTs outperform the NVIDIA CUFFT 3.0 library by up to 38x and deliver up to 3x higher performance compared to a manually-tuned FFT.