High performance RDMA based all-to-all broadcast for infiniband clusters

Authors:
S. Sur;U. K. R. Bondhugula;A. Mamidala;H. -W. Jin;D. K. Panda
Affiliations:
Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio;Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio;Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio;Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio;Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio
Venue:
HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Year:
2005

Citing 4
Cited 7

Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems

IEEE Transactions on Parallel and Distributed Systems
A Framework for Collective Personalized Communication

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Efficient and Scalable All-to-All Personalized Exchange for InfiniBand-Based Clusters

ICPP '04 Proceedings of the 2004 International Conference on Parallel Processing
Efficient Barrier and Allreduce on Infiniband clusters using multicast and adaptive algorithms

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing

Efficient shared memory and RDMA based collectives on multi-rail QsNetII SMP clusters

Cluster Computing
Efficient RDMA-based multi-port collectives on multi-rail QsNetII clusters

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Efficient shared memory and RDMA based design for MPI_Allgather over infiniband

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Analysis of the memory registration process in the mellanox infiniband software stack

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
High-performance RMA-based broadcast on the intel SCC

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
A new parallel block aggregated algorithm for solving Markov chains

The Journal of Supercomputing
Assessing the performance and scalability of a novel multilevel k-nomial allgather on CORE-Direct systems

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The All-to-all broadcast collective operation is essential for many parallel scientific applications. This collective operation is called MPI_Allgather in the context of MPI. Contemporary MPI software stacks implement this collective on top of MPI point-to-point calls leading to several performance overheads. In this paper, we propose a design of All-to-All broadcast using the Remote Direct Memory Access (RDMA) feature offered by InfiniBand, an emerging high performance interconnect. Our RDMA based design eliminates the overheads associated with existing designs. Our results indicate that latency of the All-to-all Broadcast operation can be reduced by 30% for 32 processes and a message size of 32 KB. In addition, our design can improve the latency by a factor of 4.75 under no buffer reuse conditions for the same process count and message size. Further, our design can improve performance of a parallel matrix multiplication algorithm by 37% on eight processes, while multiplying a 256x256 matrix.