Large-scale FFT on GPU clusters

Authors:
Yifeng Chen;Xiang Cui;Hong Mei
Affiliations:
Peking University, Beijing, China;Peking University, Beijing, China;Peking University, Beijing, China
Venue:
Proceedings of the 24th ACM International Conference on Supercomputing
Year:
2010

Citing 10
Cited 11

A high performance parallel algorithm for 1-D FFT

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Logic of global synchrony

ACM Transactions on Programming Languages and Systems (TOPLAS)
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
High performance discrete Fourier transforms on graphics processors

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Bandwidth intensive 3-D FFT kernel for GPUs using CUDA

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Accelerating linpack with CUDA on heterogenous clusters

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
3D finite difference computation on GPUs using CUDA

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Auto-tuning 3-D FFT library for CUDA GPUs

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Improving Performance of Matrix Multiplication and FFT on GPU

ICPADS '09 Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems

GPU-based FFT computation for multi-gigabit wirelessHD baseband processing

EURASIP Journal on Wireless Communications and Networking
Using GPUs to compute large out-of-card FFTs

Proceedings of the international conference on Supercomputing
Astrophysical particle simulations with large custom GPU clusters on three continents

Computer Science - Research and Development
Programming-Level Power Measurement for GPU Clusters

GREENCOM '11 Proceedings of the 2011 IEEE/ACM International Conference on Green Computing and Communications
PARRAY: a unifying array representation for heterogeneous parallelism

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
A GPU-based high-throughput image retrieval algorithm

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
CPU/GPU computing for long-wave radiation physics on large GPU clusters

Computers & Geosciences
Scalable multi-GPU 3-D FFT for TSUBAME 2.0 supercomputer

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Computational physics on graphics processing units

PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Algebraic program semantics for supercomputing

Theories of Programming and Formal Methods

Quantified Score

Hi-index	0.00

Visualization

Abstract

A GPU cluster is a cluster equipped with GPU devices. Excellent acceleration is achievable for computation-intensive tasks (e. g. matrix multiplication and LINPACK) and bandwidth-intensive tasks with data locality (e. g. finite-difference simulation). Bandwidth-intensive tasks such as large-scale FFTs without data locality are harder to accelerate, as the bottleneck often lies with the PCI between main memory and GPU device memory or the communication network between workstation nodes. That means optimizing the performance of FFT for a single GPU device will not improve the overall performance. This paper uses large-scale FFT as an example to show how to achieve substantial speedups for these more challenging tasks on a GPU cluster. Three GPU-related factors lead to better performance: firstly the use of GPU devices improves the sustained memory bandwidth for processing large-size data; secondly GPU device memory allows larger subtasks to be processed in whole and hence reduces repeated data transfers between memory and processors; and finally some costly main-memory operations such as matrix transposition can be significantly sped up by GPUs if necessary data adjustment is performed during data transfers. This technique of manipulating array dimensions during data transfer is the main technical contribution of this paper. These factors (as well as the improved communication library in our implementation) attribute to 24.3x speedup with respect to FFTW and 7x speedup with respect to Intel MKL for 4096 3D single-precision FFT on a 16-node cluster with 32 GPUs. Around 5x speedup with respect to both standard libraries are achieved for double precision.