Computational frameworks for the fast Fourier transform
Computational frameworks for the fast Fourier transform
Fast matrix multiplies using graphics hardware
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
A memory model for scientific algorithms on graphics processors
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
High performance discrete Fourier transforms on graphics processors
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Bandwidth intensive 3-D FFT kernel for GPUs using CUDA
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Benchmarking GPUs to tune dense linear algebra
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
State-of-the-art in heterogeneous computing
Scientific Programming
An empirically tuned 2D and 3D FFT library on CUDA GPU
Proceedings of the 24th ACM International Conference on Supercomputing
Large-scale FFT on GPU clusters
Proceedings of the 24th ACM International Conference on Supercomputing
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
OpenMPC: Extended OpenMP Programming and Tuning for GPUs
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Auto-tuning of fast fourier transform on graphics processors
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Optimizing and auto-tuning belief propagation on the GPU
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Using GPUs to compute large out-of-card FFTs
Proceedings of the international conference on Supercomputing
Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery
High-performance 3D compressive sensing MRI reconstruction using many-core architectures
Journal of Biomedical Imaging - Special issue on Parallel Computation in Medical Imaging Applications
PARRAY: a unifying array representation for heterogeneous parallelism
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
High performance 3-D FFT using multiple CUDA GPUs
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
True 4D image denoising on the GPU
Journal of Biomedical Imaging - Special issue on Parallel Computation in Medical Imaging Applications
Proceedings of the 9th conference on Computing Frontiers
The tradeoffs of fused memory hierarchies in heterogeneous computing architectures
Proceedings of the 9th conference on Computing Frontiers
Spherical harmonic transform with GPUs
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Auto-generation and auto-tuning of 3D stencil codes on GPU clusters
Proceedings of the Tenth International Symposium on Code Generation and Optimization
Automatic restructuring of GPU kernels for exploiting inter-thread data locality
CC'12 Proceedings of the 21st international conference on Compiler Construction
Scalable multi-GPU 3-D FFT for TSUBAME 2.0 supercomputer
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Aspen: a domain specific language for performance modeling
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
OpenMPC: extended OpenMP for efficient programming and tuning on GPUs
International Journal of Computational Science and Engineering
Portable performance on heterogeneous architectures
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Influence of memory access patterns to small-scale FFT performance
The Journal of Supercomputing
Scaling large-data computations on multi-GPU accelerators
Proceedings of the 27th international ACM conference on International conference on supercomputing
Hi-index | 0.00 |
Existing implementations of FFTs on GPUs are optimized for specific transform sizes like powers of two, and exhibit unstable and peaky performance i.e., do not perform as well in other sizes that appear in practice. Our new auto-tuning 3-D FFT on CUDA generates high performance CUDA kernels for FFTs of varying transform sizes, alleviating this problem. Although auto-tuning has been implemented on GPUs for dense kernels such as DGEMM and stencils, this is the first instance that has been applied comprehensively to bandwidth intensive and complex kernels such as 3-D FFTs. Bandwidth intensive optimizations such as selecting the number of threads and inserting padding to avoid bank conflicts on shared memory are systematically applied. Our resulting autotuner is fast and results in performance that essentially beats all 3-D FFT implementations on a single processor to date, and moreover exhibits stable performance irrespective of problem sizes or the underlying GPU hardware.