Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems
IEEE Transactions on Parallel and Distributed Systems
Optimizing threaded MPI execution on SMP clusters
ICS '01 Proceedings of the 15th international conference on Supercomputing
Optimal Broadcasting in Mesh-Connected Architectures
Optimal Broadcasting in Mesh-Connected Architectures
Overview of the IBM Blue Gene/P project
IBM Journal of Research and Development
HyperX: topology, routing, and packaging of efficient large-scale networks
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A Pipelined Algorithm for Large, Irregular All-Gather Problems
International Journal of High Performance Computing Applications
Optimal bucket algorithms for large MPI collectives on torus interconnects
Proceedings of the 24th ACM International Conference on Supercomputing
Fat-Trees Routing and Node Ordering Providing Contention Free Traffic for MPI Global Collectives
IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
Global combine on mesh architectures with wormhole routing
IPPS '93 Proceedings of the 1993 Seventh International Parallel Processing Symposium
Versatile communication algorithms for data analysis
EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Bandwidth-optimal all-to-all exchanges in fat tree networks
Proceedings of the 27th international ACM conference on International conference on supercomputing
Hi-index | 0.00 |
Known algorithms for two important collective communication operations, allgather and reduce-scatter, are minimal-communication algorithms; no process sends or receives more than the minimum amount of data. This, combined with the data-ordering semantics of the operations, limits the flexibility and performance of these algorithms. Our novel non-minimal, topology-aware algorithms deliver far better performance with the addition of a very small amount of redundant communication. We develop novel algorithms for Clos networks and single or multi-ported torus networks. Tests on a 32k-node BlueGene/P result in allgather speedups of up to 6x and reduce-scatter speedups of over 11x compared to the native IBM algorithm. Broadcast, reduce, and allreduce can be composed of allgather or reduce-scatter and other collective operations; our techniques also improve the performance of these algorithms.