Designing and auto-tuning parallel 3-D FFT for computation-communication overlap

Authors:
Sukhyun Song;Jeffrey K. Hollingsworth
Affiliations:
University of Maryland, College Park, MD, USA;University of Maryland, College Park, MD, USA
Venue:
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
2014

Citing 14
Cited 0

More iteration space tiling

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Active harmony: towards automated performance tuning

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Implementation and performance analysis of non-blocking collective operations for MPI

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Leveraging non-blocking collective communication in high-performance applications

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Scalable framework for 3D FFTs on the Blue Gene/L supercomputer: implementation and early performance measurements

IBM Journal of Research and Development
An implementation of parallel 3-D FFT with 2-D decomposition on a massively parallel cluster of multi-core processors

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Petascale Direct Numerical Simulation of Blood Flow on 200K Cores and Heterogeneous Architectures

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Overlapping Methods of All-to-All Communication and FFT Algorithms for Torus-Connected Massively Parallel Supercomputers

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Optimizing bandwidth limited problems using one-sided communication and overlap

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Auto-tuning of fast fourier transform on graphics processors

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand clusters: a study with parallel 3D FFT

Computer Science - Research and Development
Online Adaptive Code Generation and Tuning

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
4.45 Pflops astrophysical N-body simulation on K computer: the gravitational trillion-body problem

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Parallel implementation and scalability analysis of 3D Fast Fourier Transform using 2D domain decomposition

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a method to design and auto-tune a new parallel 3-D FFT code using the non-blocking MPI all-to-all operation. We achieve high performance by optimizing computation-communication overlap. Our code performs fully asynchronous communication without any support from special hardware. We also improve cache performance through loop tiling. To cope with the complex trade-off regarding our optimization techniques, we parameterize our code and auto-tune the parameters efficiently in a large parameter space. Experimental results from two systems confirm that our code achieves a speedup of up to 1.76x over the FFTW library.