High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand clusters: a study with parallel 3D FFT

Authors:
Krishna Kandalla;Hari Subramoni;Karen Tomko;Dmitry Pekurovsky;Sayantan Sur;Dhabaleswar K. Panda
Affiliations:
The Ohio State University, Columbus, USA;The Ohio State University, Columbus, USA;The Ohio Supercomputer Center, Columbus, USA;San Diego Supercomputer Center, San Diego, USA;The Ohio State University, Columbus, USA;The Ohio State University, Columbus, USA
Venue:
Computer Science - Research and Development
Year:
2011

Citing 5
Cited 3

Exploiting Hierarchy in Parallel Computer Networks to Optimize Collective Operation Performance

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
MPI Collectives on Modern Multicore Clusters: Performance Optimizations and Communication Characteristics

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
ConnectX-2 InfiniBand Management Queues: First Investigation of the New Support for Network Offloaded Collective Operations

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Design and Evaluation of Generalized Collective Communication Primitives with Overlap Using ConnectX-2 Offload Engine

HOTI '10 Proceedings of the 2010 18th IEEE Symposium on High Performance Interconnects
A case for non-blocking collective operations

ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking

Design and Implementation of Portable and Efficient Non-blocking Collective Communication

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Massively parallel direct numerical simulations of forced compressible turbulence: a hybrid MPI/OpenMP approach

Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond
Designing and auto-tuning parallel 3-D FFT for computation-communication overlap

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Three-dimensional FFT is an important component of many scientific computing applications ranging from fluid dynamics, to astrophysics and molecular dynamics. P3DFFT is a widely used three-dimensional FFT package. It uses the Message Passing Interface (MPI) programming model. The performance and scalability of parallel 3D FFT is limited by the time spent in the Alltoall Personalized exchange (MPI_Alltoall) operations. Hiding the latency of the MPI_Alltoall operation is critical towards scaling P3DFFT. The newest revision of MPI, MPI-3, is widely expected to provide support for non-blocking collective communication to enable latency-hiding. The latest InfiniBand adapter from Mellanox, ConnectX-2, enables offloading of generalized lists of communication operations to the network interface. Such an interface can be leveraged to design non-blocking collective operations. In this paper, we design a scalable, non-blocking Alltoall Personalized Exchange algorithm based on the network offload technology. To the best of our knowledge, this is the first paper to propose high performance non-blocking algorithms for dense collective operations, by leveraging InfiniBand's network offload features. We also re-design the P3DFFT library and a sample application kernel to overlap the Alltoall operations with application-level computation. We are able to scale our implementation of the non-blocking Alltoall operation to more than 512 processes and we achieve near perfect computation/communication overlap (99%). We also see an improvement of about 23% in the overall run-time of our modified P3DFFT when compared to the default-blocking version and an improvement of about 17% when compared to the host-based non-blocking Alltoall schemes.