Optimal bucket algorithms for large MPI collectives on torus interconnects

Authors:
Nikhil Jain;Yogish Sabharwal
Affiliations:
IBM Research - India, New Delhi, India;IBM Research - India, New Delhi, India
Venue:
Proceedings of the 24th ACM International Conference on Supercomputing
Year:
2010

Citing 11
Cited 5

Data communication in hypercubes

Journal of Parallel and Distributed Computing
Optimum Broadcasting and Personalized Communication in Hypercubes

IEEE Transactions on Computers
On global combine operations

Journal of Parallel and Distributed Computing
Global combine algorithms for 2-D meshes with wormhole routing

Journal of Parallel and Distributed Computing
Broadcasting on meshes with wormhole routing

Journal of Parallel and Distributed Computing
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Fast Collective Communication Libraries, Please

Fast Collective Communication Libraries, Please
Collective communication on architectures that support simultaneous communication over multiple links

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Collective communication: theory, practice, and experience: Research Articles

Concurrency and Computation: Practice & Experience
Overview of the IBM Blue Gene/P project

IBM Journal of Research and Development
MPI collective communications on the blue gene/p supercomputer: algorithms and optimizations

Proceedings of the 23rd international conference on Supercomputing

Faster topology-aware collective algorithms through non-minimal communication

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Collective algorithms for sub-communicators

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Collective algorithms for sub-communicators

Proceedings of the 26th ACM international conference on Supercomputing
Collectives on two-tier direct networks

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
The design of ultra scalable MPI collective communication on the K computer

Computer Science - Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

Collectives are an important and frequently used component of MPI. Bucket algorithms, also known as "large vector" algorithms, were introduced in the early 90's and have since evolved as a well known paradigm for large MPI collectives. Many modern day supercomputers such as the IBM Blue Gene and Cray XT are based on torus interconnects that offer a highly scalable interconnection architecture for distributed memory systems. While near optimal algorithms have been developed for torus interconnects in other paradigms, such as spanning trees, bucket algorithms have not been optimally extended to these networks. In this paper, we study the basic "divide, distribute and gather" MPI collectives for bucket algorithms -- Allgather, Reduce-scatter and Allreduce -- for large messages on torus interconnects. We present bucket-based algorithms for these collectives on bidirectional links. We show that these algorithms are optimal in terms of bandwidth and computation for symmetric torus networks (i.e. when all the dimensions are equal), matching the theoretical lower bounds For an asymmetric torus, our algorithms are asymptotically optimal and converge to the lower bound for large dimension sizes. We also argue that our bucket algorithms are more scalable on multi-cores in comparison to spanning tree algorithms. Previous studies of bucket algorithms on torus interconnects have focused on unidirectional links and have been unable to obtain tight lower bounds and optimal algorithms. We close this gap by providing stronger lower bounds and showing that our bidirectional algorithms can easily be adapted to the unidirectional case, matching our lower bounds in terms of bandwidth and computational complexity. We implement our algorithms on the IBM Blue Gene/P Supercomputer, which has quad-core nodes connected in a 3-dimensional torus, using the low level communication interface. We demonstrate that our algorithms perform within 7--30% of the lower bounds for different MPI collectives. We demonstrate good scaling using multicores. We also demonstrate a factor of 3 to 17 speedup for various collectives in comparison to the latest optimized MPI implementation.