Scalable multi-GPU 3-D FFT for TSUBAME 2.0 supercomputer

Authors:
Akira Nukada;Kento Sato;Satoshi Matsuoka
Affiliations:
Tokyo Institute of Technology and Japan Science and Technology Agency;Tokyo Institute of Technology;Tokyo Institute of Technology and National Institute of Informatics and Japan Science and Technology Agency
Venue:
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2012

Citing 16
Cited 1

Computational frameworks for the fast Fourier transform

Computational frameworks for the fast Fourier transform
Implementation of parallel FFT algorithms on distributed memory machines with a minimum overhead of communication

Parallel Computing
16.4-Tflops direct numerical simulation of turbulence by a Fourier spectral method on the Earth Simulator

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
The FFT on a GPU

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
The development and integration of a distributed 3D FFT for a cluster of workstations

ALS'00 Proceedings of the 4th annual Linux Showcase & Conference - Volume 4
NVIDIA Tesla: A Unified Graphics and Computing Architecture

IEEE Micro
High performance discrete Fourier transforms on graphics processors

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Bandwidth intensive 3-D FFT kernel for GPUs using CUDA

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Auto-tuning 3-D FFT library for CUDA GPUs

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Scalable framework for 3D FFTs on the Blue Gene/L supercomputer: implementation and early performance measurements

IBM Journal of Research and Development
Large-scale FFT on GPU clusters

Proceedings of the 24th ACM International Conference on Supercomputing
An implementation of parallel 3-D FFT with 2-D decomposition on a massively parallel cluster of multi-core processors

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Overlapping Methods of All-to-All Communication and FFT Algorithms for Torus-Connected Massively Parallel Supercomputers

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Optimizing bandwidth limited problems using one-sided communication and overlap

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Auto-tuning of fast fourier transform on graphics processors

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
High performance 3-D FFT using multiple CUDA GPUs

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units

Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

For scalable 3-D FFT computation using multiple GPUs, efficient all-to-all communication between GPUs is the most important factor in good performance. Implementations with point-to-point MPI library functions and CUDA memory copy APIs typically exhibit very large overheads especially for small message sizes in all-to-all communications between many nodes. We propose several schemes to minimize the overheads, including employment of lower-level API of InfiniBand to effectively overlap intra- and inter-node communication, as well as auto-tuning strategies to control scheduling and determine rail assignments. As a result we achieve very good strong scalability as well as good performance, up to 4.8TFLOPS using 256 nodes of TSUBAME 2.0 Supercomputer (768 GPUs) in double precision.