A high performance parallel algorithm for 1-D FFT
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
ACM Transactions on Programming Languages and Systems (TOPLAS)
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
High performance discrete Fourier transforms on graphics processors
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Bandwidth intensive 3-D FFT kernel for GPUs using CUDA
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Benchmarking GPUs to tune dense linear algebra
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Accelerating linpack with CUDA on heterogenous clusters
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
3D finite difference computation on GPUs using CUDA
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Auto-tuning 3-D FFT library for CUDA GPUs
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Improving Performance of Matrix Multiplication and FFT on GPU
ICPADS '09 Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems
GPU-based FFT computation for multi-gigabit wirelessHD baseband processing
EURASIP Journal on Wireless Communications and Networking
Using GPUs to compute large out-of-card FFTs
Proceedings of the international conference on Supercomputing
Astrophysical particle simulations with large custom GPU clusters on three continents
Computer Science - Research and Development
Programming-Level Power Measurement for GPU Clusters
GREENCOM '11 Proceedings of the 2011 IEEE/ACM International Conference on Green Computing and Communications
PARRAY: a unifying array representation for heterogeneous parallelism
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
A GPU-based high-throughput image retrieval algorithm
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
CPU/GPU computing for long-wave radiation physics on large GPU clusters
Computers & Geosciences
Scalable multi-GPU 3-D FFT for TSUBAME 2.0 supercomputer
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Computational physics on graphics processing units
PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Algebraic program semantics for supercomputing
Theories of Programming and Formal Methods
Hi-index | 0.00 |
A GPU cluster is a cluster equipped with GPU devices. Excellent acceleration is achievable for computation-intensive tasks (e. g. matrix multiplication and LINPACK) and bandwidth-intensive tasks with data locality (e. g. finite-difference simulation). Bandwidth-intensive tasks such as large-scale FFTs without data locality are harder to accelerate, as the bottleneck often lies with the PCI between main memory and GPU device memory or the communication network between workstation nodes. That means optimizing the performance of FFT for a single GPU device will not improve the overall performance. This paper uses large-scale FFT as an example to show how to achieve substantial speedups for these more challenging tasks on a GPU cluster. Three GPU-related factors lead to better performance: firstly the use of GPU devices improves the sustained memory bandwidth for processing large-size data; secondly GPU device memory allows larger subtasks to be processed in whole and hence reduces repeated data transfers between memory and processors; and finally some costly main-memory operations such as matrix transposition can be significantly sped up by GPUs if necessary data adjustment is performed during data transfers. This technique of manipulating array dimensions during data transfer is the main technical contribution of this paper. These factors (as well as the improved communication library in our implementation) attribute to 24.3x speedup with respect to FFTW and 7x speedup with respect to Intel MKL for 4096 3D single-precision FFT on a 16-node cluster with 32 GPUs. Around 5x speedup with respect to both standard libraries are achieved for double precision.