Optimizing All-to-All Collective Communication by Exploiting Concurrency in Modern Networks
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
High performance MPI design using unreliable datagram for ultra-scale InfiniBand clusters
Proceedings of the 21st annual international conference on Supercomputing
Performance Analysis of Leading HPC Architectures With Beambeam3D
International Journal of High Performance Computing Applications
Bandwidth optimal all-reduce algorithms for clusters of workstations
Journal of Parallel and Distributed Computing
A study of process arrival patterns for MPI collective operations
International Journal of Parallel Programming
Efficient SMP-aware MPI-level broadcast over InfiniBand's hardware multicast
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A preliminary analysis of the infinipath and XD1 network interfaces
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Efficient shared memory and RDMA based design for MPI_Allgather over infiniband
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
High performance RDMA based all-to-all broadcast for infiniband clusters
HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Efficient hardware multicast group management for multiple MPI communicators over infiniband
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Communication-Efficient algorithms for numerical quantum dynamics
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Hi-index | 0.00 |
Popular algorithms proposed in the literature for doing Barrier and Allreduce in clusters, such as pair-wise exchange, dissemination and gather-broadcast do not give an optimal performance when there is skew among the nodes in the cluster. In pair-wise exchange and dissemination, all the nodes must arrive for the completion of each step. The gather-broadcast algorithm assumes a fixed tree topology. We propose to use hardware multicast of InfiniBand in the design of an adaptive algorithm that performs well in the presence of skew. In this approach, the topology of the tree is not fixed but adapts depending on the skew. The last arriving node becomes the root of the tree if the skew is sufficiently large. We have carried out in-depth evaluation of our scheme and use synchronization delay as the performance metric for Barrier and Allreduce in the presence of skew. Our performance evaluation shows that our design scales very well with system size. Our designs can reduce the synchronization delay by a factor of 2.28 for Barrier and by a factor of 2.18 in the case of Allreduce. We have examined different skew scenarios and showed that the adaptive design performs either better or comparably to the existing schemes.