Efficient SMP-aware MPI-level broadcast over InfiniBand's hardware multicast

Authors:
Amith R. Mamidala;Lei Chai;Hyun-Wook Jin;Dhabaleswar K. Panda
Affiliations:
Department of Computer Science and Engineering, The Ohio State University, Columbus, OH;Department of Computer Science and Engineering, The Ohio State University, Columbus, OH;Department of Computer Science and Engineering, The Ohio State University, Columbus, OH;Department of Computer Science and Engineering, The Ohio State University, Columbus, OH
Venue:
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Year:
2006

Citing 4
Cited 2

MPI-The Complete Reference, Volume 1: The MPI Core

MPI-The Complete Reference, Volume 1: The MPI Core
Fast Collective Operations Using Shared and Remote Memory Access Protocols on Clusters

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Efficient Barrier and Allreduce on Infiniband clusters using multicast and adaptive algorithms

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Efficient hardware multicast group management for multiple MPI communicators over infiniband

PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface

Optimization of collective communication in intra-cell MPI

HiPC'07 Proceedings of the 14th international conference on High performance computing
Scalability limits of Bag-of-Tasks applications running on hierarchical platforms

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most of the high-end computing clusters found today feature multi-way SMP nodes interconnected by an ultra-low latency and high bandwidth network. InfiniBand is emerging as a high-speed network for such systems. InfiniBand provides a scalable and efficient hardware multicast primitive to efficiently implement many MPI collective operations. However, employing hardware multicast as the communication method may not perform well in all cases. This is true especially when more than one process is running per node. In this context, shared memory channel becomes the desired communication medium within the node as it delivers latencies which are of an order of magnitude lower than the inter-node message latencies. Thus, to deliver optimal collective performance, coupling hardware multicast with shared memory channel becomes necessary. In this paper we propose mechanisms to address this issue. On a 16-node 2-way SMP cluster, the Leader-based scheme proposed in this paper improves the performance of the MPI Bcast operation by a factor of as much as 2.3 and 1.8 when compared to the point-to-point and original solution employing only hardware multicast. We have also evaluated our designs on NUMA based system and obtained a performance improvement of 1.7 using our designs on 2-node 4-way system. We also propose a Dynamic Attach Policy as an enhancement to this scheme to mitigate the impact of process skew on the performance of the collective operation.