FFTs in external or hierarchical memory
The Journal of Supercomputing
Computational frameworks for the fast Fourier transform
Computational frameworks for the fast Fourier transform
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
The potential of the cell processor for scientific computing
Proceedings of the 3rd conference on Computing frontiers
A memory model for scientific algorithms on graphics processors
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
ShaderX2: Shader Programming Tips and Tricks with DirectX 9.0
ShaderX2: Shader Programming Tips and Tricks with DirectX 9.0
FFTC: fastest Fourier transform for the IBM cell broadband engine
HiPC'07 Proceedings of the 14th international conference on High performance computing
GPU acceleration of a production molecular docking code
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Experiences with Mapping Non-linear Memory Access Patterns into GPUs
ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Efficient Multiplication of Polynomials on Graphics Hardware
APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
Auto-tuning 3-D FFT library for CUDA GPUs
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
ACM SIGGRAPH 2009 Courses
An adaptive performance modeling tool for GPU architectures
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
High-performance signal processing on emerging many-core architectures using CUDA
ICME'09 Proceedings of the 2009 IEEE international conference on Multimedia and Expo
GPU accelerated simulations of bluff body flows using vortex particle methods
Journal of Computational Physics
State-of-the-art in heterogeneous computing
Scientific Programming
A GPGPU compiler for memory optimization and parallelism management
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
An empirically tuned 2D and 3D FFT library on CUDA GPU
Proceedings of the 24th ACM International Conference on Supercomputing
Large-scale FFT on GPU clusters
Proceedings of the 24th ACM International Conference on Supercomputing
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
Proceedings of the 37th annual international symposium on Computer architecture
Fitting FFT onto an energy efficient massively parallel architecture
Proceedings of the Second International Forum on Next-Generation Multicore/Manycore Technologies
Parallel simulation for parameter estimation of optical tissue properties
Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
GPU-based FFT computation for multi-gigabit wirelessHD baseband processing
EURASIP Journal on Wireless Communications and Networking
Auto-tuning of fast fourier transform on graphics processors
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Power and Performance Characterization of Computational Kernels on the GPU
GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmark
ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Using GPUs to compute large out-of-card FFTs
Proceedings of the international conference on Supercomputing
Introducing scalable quantum approaches in language representation
QI'11 Proceedings of the 5th international conference on Quantum interaction
PARRAY: a unifying array representation for heterogeneous parallelism
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
High performance 3-D FFT using multiple CUDA GPUs
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
On the use of small 2d convolutions on GPUs
ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
A unified optimizing compiler framework for different GPGPU architectures
ACM Transactions on Architecture and Code Optimization (TACO)
Automatic restructuring of GPU kernels for exploiting inter-thread data locality
CC'12 Proceedings of the 21st international conference on Compiler Construction
An FFT performance model for optimizing general-purpose processor architecture
Journal of Computer Science and Technology - Special issue on Community Analysis and Information Recommendation
A transpose-free in-place SIMD optimized FFT
ACM Transactions on Architecture and Code Optimization (TACO)
Shared memory multiplexing: a novel way to improve GPGPU throughput
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Scalable multi-GPU 3-D FFT for TSUBAME 2.0 supercomputer
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
GPU optimization of convolution for large 3-d real images
ACIVS'12 Proceedings of the 14th international conference on Advanced Concepts for Intelligent Vision Systems
Grex: An efficient MapReduce framework for graphics processing units
Journal of Parallel and Distributed Computing
Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
GPU-based approaches for real-time sound source localization using the SRP-PHAT algorithm
International Journal of High Performance Computing Applications
CUDA-NP: realizing nested thread-level parallelism in GPGPU applications
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Theoretical analysis of classic algorithms on highly-threaded many-core GPUs
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Hi-index | 0.00 |
We present novel algorithms for computing discrete Fourier transforms with high performance on GPUs. We present hierarchical, mixed radix FFT algorithms for both power-of-two and non-power-of-two sizes. Our hierarchical FFT algorithms efficiently exploit shared memory on GPUs using a Stockham formulation. We reduce the memory transpose overheads in hierarchical algorithms by combining the transposes into a block-based multi-FFT algorithm. For non-power-of-two sizes, we use a combination of mixed radix FFTs of small primes and Bluestein's algorithm. We use modular arithmetic in Bluestein's algorithm to improve the accuracy. We implemented our algorithms using the NVIDIA CUDA API and compared their performance with NVIDIA's CUFFT library and an optimized CPU-implementation (Intel's MKL) on a high-end quad-core CPU. On an NVIDIA GPU, we obtained performance of up to 300 GFlops, with typical performance improvements of 2--4x over CUFFT and 8--40x improvement over MKL for large sizes.