Faster topology-aware collective algorithms through non-minimal communication

Authors:
Paul Sack;William Gropp
Affiliations:
University of Illinois at Urbana-Champaign, Urbana, IL, USA;University of Illinois at Urbana-Champaign, Urbana, IL, USA
Venue:
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Year:
2012

Citing 10
Cited 2

A high-performance, portable implementation of the MPI message passing interface standard

Parallel Computing
Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems

IEEE Transactions on Parallel and Distributed Systems
Optimizing threaded MPI execution on SMP clusters

ICS '01 Proceedings of the 15th international conference on Supercomputing
Optimal Broadcasting in Mesh-Connected Architectures

Optimal Broadcasting in Mesh-Connected Architectures
Overview of the IBM Blue Gene/P project

IBM Journal of Research and Development
HyperX: topology, routing, and packaging of efficient large-scale networks

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A Pipelined Algorithm for Large, Irregular All-Gather Problems

International Journal of High Performance Computing Applications
Optimal bucket algorithms for large MPI collectives on torus interconnects

Proceedings of the 24th ACM International Conference on Supercomputing
Fat-Trees Routing and Node Ordering Providing Contention Free Traffic for MPI Global Collectives

IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
Global combine on mesh architectures with wormhole routing

IPPS '93 Proceedings of the 1993 Seventh International Parallel Processing Symposium

Versatile communication algorithms for data analysis

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Bandwidth-optimal all-to-all exchanges in fat tree networks

Proceedings of the 27th international ACM conference on International conference on supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Known algorithms for two important collective communication operations, allgather and reduce-scatter, are minimal-communication algorithms; no process sends or receives more than the minimum amount of data. This, combined with the data-ordering semantics of the operations, limits the flexibility and performance of these algorithms. Our novel non-minimal, topology-aware algorithms deliver far better performance with the addition of a very small amount of redundant communication. We develop novel algorithms for Clos networks and single or multi-ported torus networks. Tests on a 32k-node BlueGene/P result in allgather speedups of up to 6x and reduce-scatter speedups of over 11x compared to the native IBM algorithm. Broadcast, reduce, and allreduce can be composed of allgather or reduce-scatter and other collective operations; our techniques also improve the performance of these algorithms.