Implementation and performance analysis of non-blocking collective operations for MPI

Authors:
Torsten Hoefler;Andrew Lumsdaine;Wolfgang Rehm
Affiliations:
Indiana University, Bloomington, IN;Indiana University, Bloomington, IN;Chemnitz University of Technology, Chemnitz, Germany
Venue:
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Year:
2007

Citing 22
Cited 34

The butterfly barrier

International Journal of Parallel Programming
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Fast, contention-free combining tree barriers for shared-memory multiprocessors

International Journal of Parallel Programming
Basic Linear Algebra Subprograms for Fortran Usage

ACM Transactions on Mathematical Software (TOMS)
Automatically tuned collective communications

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Overlap of computation and communication on shared-memory networks-of-workstations

Cluster computing
COMB: A Portable Benchmark Suite for Assessing MPI Overlap

CLUSTER '02 Proceedings of the IEEE International Conference on Cluster Computing
MPI/RT --- An Emerging Standard for High-Performance Real-Time Systems

HICSS '98 Proceedings of the Thirty-First Annual Hawaii International Conference on System Sciences - Volume 3
A Framework for Collective Personalized Communication

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Analyzing the Impact of Overlap, Offload, and Independent Progress for Message Passing Interface Applications

International Journal of High Performance Computing Applications
A Practical Approach to the Rating of Barrier Algorithms Using the LogP Model and Open MPI

ICPPW '05 Proceedings of the 2005 International Conference on Parallel Processing Workshops
HUNTing the Overlap

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
High performance RDMA-based MPI implementation over infiniBand

International Journal of Parallel Programming - Special issue I: The 17th annual international conference on supercomputing (ICS'03)
Operating system issues for petascale systems

ACM SIGOPS Operating Systems Review
Assessing Single-Message and Multi-Node Communication Performance of InfiniBand

PARELEC '06 Proceedings of the international symposium on Parallel Computing in Electrical Engineering
The development and integration of a distributed 3D FFT for a cluster of workstations

ALS'00 Proceedings of the 4th annual Linux Showcase & Conference - Volume 4
Fast barrier synchronization for InfiniBand™

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Optimizing bandwidth limited problems using one-sided communication and overlap

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A case for non-blocking collective operations

ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking
Issues in developing a thread-safe MPI implementation

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Optimizing a conjugate gradient solver with non-blocking collective operations

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Analysis of the memory registration process in the mellanox infiniband software stack

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing

Advanced collective communication in aspen

Proceedings of the 22nd annual international conference on Supercomputing
Leveraging non-blocking collective communication in high-performance applications

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Architecture of the Component Collective Messaging Interface

Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Sparse Non-blocking Collectives in Quantum Mechanical Calculations

Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Communication Optimization for Medical Image Reconstruction Algorithms

Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
MPI-aware compiler optimizations for improving communication-computation overlap

Proceedings of the 23rd international conference on Supercomputing
Towards Efficient MapReduce Using MPI

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Scalable communication protocols for dynamic sparse data exchange

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Architecture of the Component Collective Messaging Interface

International Journal of High Performance Computing Applications
AM++: a generalized active message framework

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Design of kernel-level asynchronous collective communication

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Network offloaded hierarchical collectives using ConnectX-2's CORE-Direct capabilities

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Overlapping communication with computation using OpenMP tasks on the GTS magnetic fusion code

Scientific Programming - Exploring Languages for Expressing Medium to Massive On-Chip Parallelism
Kanor: a declarative language for explicit communication

PADL'11 Proceedings of the 13th international conference on Practical aspects of declarative languages
Active pebbles: parallel programming for data-driven applications

Proceedings of the international conference on Supercomputing
Hiding latency in Coarray Fortran 2.0

Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Kernel-based offload of collective operations: implementation, evaluation and lessons learned

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Design and evaluation of nonblocking collective I/O operations

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
pupyMPI - MPI implemented in pure python

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Writing parallel libraries with MPI - common practice, issues, and extensions

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Tuning collective communication for Partitioned Global Address Space programming models

Parallel Computing
Composable, non-blocking collective operations on power7 IH

Proceedings of the 26th ACM international conference on Supercomputing
Delta Send-Recv for Dynamic Pipelining in MPI Programs

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
MPI runtime error detection with MUST: advances in deadlock detection

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Optimization principles for collective neighborhood communications

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A case for standard non-blocking collective operations

PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Parallel simulation on supercomputers

Proceedings of the Winter Simulation Conference
The impact of system design parameters on application noise sensitivity

Cluster Computing
Bandwidth-optimal all-to-all exchanges in fat tree networks

Proceedings of the 27th international ACM conference on International conference on supercomputing
Expressing graph algorithms using generalized active messages

Proceedings of the 27th international ACM conference on International conference on supercomputing
Parallel bucket-brigade communication interface for scientific applications

Proceedings of the 20th European MPI Users' Group Meeting
First observations using nonblocking collectives in MVAPICH

Proceedings of the 20th European MPI Users' Group Meeting
Designing and auto-tuning parallel 3-D FFT for computation-communication overlap

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
MPI runtime error detection with MUST: Advances in deadlock detection

Scientific Programming - Selected Papers from Super Computing 2012

Quantified Score

Hi-index	0.00

Visualization

Abstract

Collective operations and non-blocking point-to-point operations have always been part of MPI. Although non-blocking collective operations are an obvious extension to MPI, there have been no comprehensive studies of this functionality. In this paper we present LibNBC, a portable high-performance library for implementing non-blocking collective MPI communication operations. LibNBC provides non-blocking versions of all MPI collective operations, is layered on top of MPI-1, and is portable to nearly all parallel architectures. To measure the performance characteristics of our implementation, we also present a microbenchmark for measuring both latency and overlap of computation and communication. Experimental results demonstrate that the blocking performance of the collective operations in our library is comparable to that of collective operations in other high-performance MPI implementations. Our library introduces a very low overhead between the application and the underlying MPI and thus, in conjunction with the potential to overlap communication with computation, offers the potential for optimizing real-world applications.