Parallel implementation and scalability analysis of 3D Fast Fourier Transform using 2D domain decomposition

Authors:
Orlando Ayala;Lian-Ping Wang
Affiliations:
Department of Mechanical Engineering, 126 Spencer Laboratory, University of Delaware, Newark, DE 19716-3140, USA and Centro de Métodos Numéricos en Ingeniería, Escuela de Ingenier&# ...;Department of Mechanical Engineering, 126 Spencer Laboratory, University of Delaware, Newark, DE 19716-3140, USA
Venue:
Parallel Computing
Year:
2013

Citing 11
Cited 2

The parallel Fourier pseudospectral method

Journal of Computational Physics
Implementation of parallel FFT algorithms on distributed memory machines with a minimum overhead of communication

Parallel Computing
Ab Initio Electronic Structure Methods in Parallel Computers

PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
A Blocking Algorithm for Parallel 1-D FFT on Clusters of PCs

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
A new parallel strategy for two-dimensional incompressible flow simulations using pseudo-spectral methods

Journal of Computational Physics
A low-cost parallel implementation of direct numerical simulation of wall turbulence

Journal of Computational Physics
A massively parallel implementation of the common azimuth pre-stack depth migration

IBM Journal of Research and Development
Scalable framework for 3D FFTs on the Blue Gene/L supercomputer: implementation and early performance measurements

IBM Journal of Research and Development
An implementation of parallel 3-D FFT with 2-D decomposition on a massively parallel cluster of multi-core processors

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Performance measurements of the 3D FFT on the blue gene/l supercomputer

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
A Modified Split-Radix FFT With Fewer Arithmetic Operations

IEEE Transactions on Signal Processing

Designing and auto-tuning parallel 3-D FFT for computation-communication overlap

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Study of forced turbulence and its modulation by finite-size solid particles using the lattice Boltzmann approach

Computers & Mathematics with Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

3D FFT is computationally intensive and at the same time requires global or collective communication patterns. The efficient implementation of FFT on extreme scale computers is one of the grand challenges in scientific computing. On parallel computers with a distributed memory, different domain decompositions are possible to scale 3D FFT computation. In this paper, we argue that 2D domain decomposition is likely the best approach in terms of using a very large number of processors with reasonable data communication overhead. Specifically, we extend the data communication approach of Dmitruk et al. (2001) [21] previously used for 1D domain decomposition, to 2D domain decomposition. A thorough quantitative analysis of the code performance is undertaken for different problem sizes and numbers of processors, including scalability, load balance, dependence on subdomain configuration (i.e., different numbers of subdomain in the two decomposed directions for a fixed total number of subdomains). We show that our proposed approach is faster than the existing attempts on 2D-decomposition of 3D FFTs by Pekurovsky (2007) [23] (p3dfft), Takahashi (2009) [24], and Li and Laizet (2010) [25] (2decomp.org) especially for the case of large problem size and large number of processors (our strategy is 28% faster than Pekurovski's scheme, its closest competitor). We also show theoretically that our scheme performs better than the approach by Nelson et al. (1993) [22] up to a certain number of processors beyond which latency becomes and issue. We demonstrate that the speedup scales with the number of processors almost linearly before it saturates. The execution time on different processors differ by less than 5%, showing an excellent load balance. We further partitioned the execution time into computation, communication, and data copying related to the transpose operation, to understand how the relative percentage of the communication time increases with the number of processors. Finally, a theoretical complexity analysis is carried out to predict the scalability and its saturation. The complexity analysis indicates that the 2D domain decomposition will make it feasible to run a large 3D FFT on scalable computers with several hundred thousands processors.