Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems
IEEE Transactions on Parallel and Distributed Systems
A Framework for Collective Personalized Communication
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Efficient and Scalable All-to-All Personalized Exchange for InfiniBand-Based Clusters
ICPP '04 Proceedings of the 2004 International Conference on Parallel Processing
Efficient Barrier and Allreduce on Infiniband clusters using multicast and adaptive algorithms
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Efficient RDMA-based multi-port collectives on multi-rail QsNetII clusters
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Efficient shared memory and RDMA based design for MPI_Allgather over infiniband
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Analysis of the memory registration process in the mellanox infiniband software stack
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
High-performance RMA-based broadcast on the intel SCC
Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
A new parallel block aggregated algorithm for solving Markov chains
The Journal of Supercomputing
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Hi-index | 0.00 |
The All-to-all broadcast collective operation is essential for many parallel scientific applications. This collective operation is called MPI_Allgather in the context of MPI. Contemporary MPI software stacks implement this collective on top of MPI point-to-point calls leading to several performance overheads. In this paper, we propose a design of All-to-All broadcast using the Remote Direct Memory Access (RDMA) feature offered by InfiniBand, an emerging high performance interconnect. Our RDMA based design eliminates the overheads associated with existing designs. Our results indicate that latency of the All-to-all Broadcast operation can be reduced by 30% for 32 processes and a message size of 32 KB. In addition, our design can improve the latency by a factor of 4.75 under no buffer reuse conditions for the same process count and message size. Further, our design can improve performance of a parallel matrix multiplication algorithm by 37% on eight processes, while multiplying a 256x256 matrix.