Data communication in hypercubes
Journal of Parallel and Distributed Computing
Optimum Broadcasting and Personalized Communication in Hypercubes
IEEE Transactions on Computers
Journal of Parallel and Distributed Computing
Global combine algorithms for 2-D meshes with wormhole routing
Journal of Parallel and Distributed Computing
Broadcasting on meshes with wormhole routing
Journal of Parallel and Distributed Computing
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Fast Collective Communication Libraries, Please
Fast Collective Communication Libraries, Please
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Collective communication: theory, practice, and experience: Research Articles
Concurrency and Computation: Practice & Experience
Overview of the IBM Blue Gene/P project
IBM Journal of Research and Development
MPI collective communications on the blue gene/p supercomputer: algorithms and optimizations
Proceedings of the 23rd international conference on Supercomputing
Faster topology-aware collective algorithms through non-minimal communication
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Collective algorithms for sub-communicators
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Collective algorithms for sub-communicators
Proceedings of the 26th ACM international conference on Supercomputing
Collectives on two-tier direct networks
EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
The design of ultra scalable MPI collective communication on the K computer
Computer Science - Research and Development
Hi-index | 0.00 |
Collectives are an important and frequently used component of MPI. Bucket algorithms, also known as "large vector" algorithms, were introduced in the early 90's and have since evolved as a well known paradigm for large MPI collectives. Many modern day supercomputers such as the IBM Blue Gene and Cray XT are based on torus interconnects that offer a highly scalable interconnection architecture for distributed memory systems. While near optimal algorithms have been developed for torus interconnects in other paradigms, such as spanning trees, bucket algorithms have not been optimally extended to these networks. In this paper, we study the basic "divide, distribute and gather" MPI collectives for bucket algorithms -- Allgather, Reduce-scatter and Allreduce -- for large messages on torus interconnects. We present bucket-based algorithms for these collectives on bidirectional links. We show that these algorithms are optimal in terms of bandwidth and computation for symmetric torus networks (i.e. when all the dimensions are equal), matching the theoretical lower bounds For an asymmetric torus, our algorithms are asymptotically optimal and converge to the lower bound for large dimension sizes. We also argue that our bucket algorithms are more scalable on multi-cores in comparison to spanning tree algorithms. Previous studies of bucket algorithms on torus interconnects have focused on unidirectional links and have been unable to obtain tight lower bounds and optimal algorithms. We close this gap by providing stronger lower bounds and showing that our bidirectional algorithms can easily be adapted to the unidirectional case, matching our lower bounds in terms of bandwidth and computational complexity. We implement our algorithms on the IBM Blue Gene/P Supercomputer, which has quad-core nodes connected in a 3-dimensional torus, using the low level communication interface. We demonstrate that our algorithms perform within 7--30% of the lower bounds for different MPI collectives. We demonstrate good scaling using multicores. We also demonstrate a factor of 3 to 17 speedup for various collectives in comparison to the latest optimized MPI implementation.