Computational frameworks for the fast Fourier transform
Computational frameworks for the fast Fourier transform
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms
International Journal of High Performance Computing Applications
A memory model for scientific algorithms on graphics processors
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Program optimization space pruning for a multithreaded gpu
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
High performance discrete Fourier transforms on graphics processors
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Bandwidth intensive 3-D FFT kernel for GPUs using CUDA
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Computer generation of fast fourier transforms for the cell broadband engine
Proceedings of the 23rd international conference on Supercomputing
Auto-tuning 3-D FFT library for CUDA GPUs
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A performance analysis framework for identifying potential benefits in GPGPU applications
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
GPU-based NFA implementation for memory efficient high speed regular expression matching
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
High performance 3-D FFT using multiple CUDA GPUs
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Parameterized micro-benchmarking: an auto-tuning approach for complex applications
Proceedings of the 9th conference on Computing Frontiers
Language identification using multi-core processors
Computer Speech and Language
Scalable multi-GPU 3-D FFT for TSUBAME 2.0 supercomputer
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Optimizing tensor contraction expressions for hybrid CPU-GPU execution
Cluster Computing
Starchart: hardware and software optimization using recursive partitioning regression trees
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Designing and auto-tuning parallel 3-D FFT for computation-communication overlap
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
In-place transposition of rectangular matrices on accelerators
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Integrated Computer-Aided Engineering
Hi-index | 0.00 |
We present an auto-tuning framework for FFTs on graphics processors (GPUs). Due to complex design of the memory and compute subsystems on GPUs, the performance of FFT kernels over the range of possible input parameters can vary widely. We generate several variants for each component of the FFT kernel that, for different cases, are likely to perform well. Our auto-tuner composes variants to generate kernels and selects the best ones. We present heuristics to prune the search space and profile only a small fraction of all possible kernels. We compose optimized kernels to improve the performance of larger FFT computations. We implement the system using the NVIDIA CUDA API and compare its performance to the state-of-the-art FFT libraries. On a range of NVIDIA GPUs and input sizes, our auto-tuned FFTs outperform the NVIDIA CUFFT 3.0 library by up to 38x and deliver up to 3x higher performance compared to a manually-tuned FFT.