Optimizing the Synchronization Operations in Message Passing Interface One-Sided Communication

Authors:
Rajeev Thakur;William Gropp;Brian Toonen
Affiliations:
Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, USA;Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, USA;Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, USA
Venue:
International Journal of High Performance Computing Applications
Year:
2005

Citing 7
Cited 8

Global arrays: a nonuniform memory access programming model for high-performance computers

The Journal of Supercomputing
BSPlib: The BSP programming library

Parallel Computing
MPI-2 implementation on Fujitsu generic message passing kernel

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Single sided MPI implementations for SUN MPIr

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Exploiting Transparent Remote Memory Access for Non-Contiguous- and One-Sided-Communication

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
GASNet Specification, v1.1

GASNet Specification, v1.1
High performance MPI-2 one-sided communication over InfiniBand

CCGRID '04 Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid

Collective communication on architectures that support simultaneous communication over multiple links

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Design issues in the implementation of MPI2 one sided communication in Ethernet based networks

PDCN'07 Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: parallel and distributed computing and networks
Multicast communication in wormhole-routed 2D torus networks with hamiltonian cycle model

Journal of Systems Architecture: the EUROMICRO Journal
Optimizing MPI one sided communication on multi-core infiniband clusters using shared memory backed windows

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Formal verification of programs that use MPI one-sided communication

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
An evaluation of implementation options for MPI one-sided communication

PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Analysis of implementation options for MPI-2 one-sided

PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Revealing the performance of MPI RMA implementations

PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface

Quantified Score

Hi-index	0.00

Visualization

Abstract

One-sided communication in Message Passing Interface (MPI) requires the use of one of three different synchronization mechanisms, which indicate when the one-sided operation can be started and when the operation is completed. Efficient implementation of the synchronization mechanisms is critical to achieving good performance with one-sided communication. However, our performance measurements indicate that in many MPI implementations, the synchronization functions add significant overhead, resulting in one-sided communication performing much worse than point-to-point communication for short- and medium-sized messages. In this paper, we describe our efforts to minimize the overhead of synchronization in our implementation of one-sided communication in MPICH2. We describe our optimizations for all three synchronization mechanisms defined in MPI: fence, post-start-complete-wait, and lock-unlock. Our performance results demonstrate that, for short messages, MPICH2 performs six times faster than LAM for fence synchronization and 50% faster for post-start-complete-wait synchronization, and it performs more than twice as fast as Sun MPI for all three synchronization methods.