Efficient shared memory and RDMA based design for MPI_Allgather over infiniband

Authors:
Amith R. Mamidala;Abhinav Vishnu;Dhabaleswar K. Panda
Affiliations:
Department of Computer Science and Engineering, The Ohio State University;Department of Computer Science and Engineering, The Ohio State University;Department of Computer Science and Engineering, The Ohio State University
Venue:
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Year:
2006

Citing 7
Cited 4

A high-performance, portable implementation of the MPI message passing interface standard

Parallel Computing
Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems

IEEE Transactions on Parallel and Distributed Systems
MPI-The Complete Reference, Volume 1: The MPI Core

MPI-The Complete Reference, Volume 1: The MPI Core
Fast Collective Operations Using Shared and Remote Memory Access Protocols on Clusters

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Efficient and Scalable All-to-All Personalized Exchange for InfiniBand-Based Clusters

ICPP '04 Proceedings of the 2004 International Conference on Parallel Processing
Efficient Barrier and Allreduce on Infiniband clusters using multicast and adaptive algorithms

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
High performance RDMA based all-to-all broadcast for infiniband clusters

HiPC'05 Proceedings of the 12th international conference on High Performance Computing

On using connection-oriented vs. connection-less transport for performance and scalability of collective and one-sided operations: trade-offs and impact

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
A Simple, Pipelined Algorithm for Large, Irregular All-gather Problems

Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Efficient shared memory and RDMA based collectives on multi-rail QsNetII SMP clusters

Cluster Computing
A Pipelined Algorithm for Large, Irregular All-Gather Problems

International Journal of High Performance Computing Applications

Quantified Score

Hi-index	0.02

Visualization

Abstract

MPI_Allgather is an important collective operation which is used in applications such as matrix multiplication and in basic linear algebra operations. With the next generation systems going multi-core, the clusters deployed would enable a high process count per node. The traditional implementations of Allgather use two separate channels, namely network channel for communication across the nodes and shared memory channel for intra-node communication. An important drawback of this approach is the lack of sharing of communication buffers across these channels. This results in extra copying of data within a node yielding sub-optimal performance. This is true especially for a collective involving large number of processes with a high process density per node. In the approach proposed in the paper, we propose a solution which eliminates the extra copy costs by sharing the communication buffers for both intra and inter node communication. Further, we optimize the performance by allowing overlap of network operations with intra-node shared memory copies. On a 32, 2-way node cluster, we observe an improvement upto a factor of two for MPI_Allgather compared to the original implementation. Also, we observe overlap benefits upto 43% for 32x2 process configuration.