Two-tree algorithms for full bandwidth broadcast, reduction and scan

Authors:
Peter Sanders;Jochen Speck;Jesper Larsson Träff
Affiliations:
Universität Karlsruhe, D-76128 Karlsruhe, Germany;Universität Karlsruhe, D-76128 Karlsruhe, Germany;NEC Laboratories Europe, NEC Europe Ltd., Rathausallee 10, D-53757 Sankt Augustin, Germany
Venue:
Parallel Computing
Year:
2009

Citing 17
Cited 5

Data parallel algorithms

Communications of the ACM - Special issue on parallelism
Optimum Broadcasting and Personalized Communication in Hypercubes

IEEE Transactions on Computers
An introduction to parallel algorithms

An introduction to parallel algorithms
Pipelined parallel prefix computations, and sorting on a pipelined hypercube

Journal of Parallel and Distributed Computing
On global combine operations

Journal of Parallel and Distributed Computing
Broadcasting multiple messages in simultaneous send/receive systems

Discrete Applied Mathematics
A high-performance, portable implementation of the MPI message passing interface standard

Parallel Computing
Optimal and near-optimal algorithms for k-item broadcast

Journal of Parallel and Distributed Computing
Optimal multiple message broadcasting in telephone-like communication systems

Discrete Applied Mathematics
MPI-The Complete Reference, Volume 1: The MPI Core

MPI-The Complete Reference, Volume 1: The MPI Core
Building a high-performance collective communication library

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
A bandwidth latency tradeoff for broadcast and reduction

Information Processing Letters
On optimizing collective communication

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Performance analysis of MPI collective operations

Cluster Computing
Optimal broadcast for fully connected processor-node networks

Journal of Parallel and Distributed Computing
Collective operations in NEC's high-performance MPI libraries

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Process cooperation in multiple message broadcast

PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface

Kernel-based offload of collective operations: implementation, evaluation and lessons learned

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Runtime detection and optimization of collective communication patterns

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Optimization principles for collective neighborhood communications

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
LibWater: heterogeneous distributed computing made easy

Proceedings of the 27th international ACM conference on International conference on supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a new, simple algorithmic idea for the collective communication operations broadcast, reduction, and scan (prefix sums). The algorithms concurrently communicate over two binary trees which both span the entire network. By careful layout and communication scheduling, each tree communicates as efficiently as a single tree with exclusive use of the network. Our algorithms thus achieve up to twice the bandwidth of most previous algorithms. In particular, our approach beats all previous algorithms for reduction and scan. Experiments on clusters with Myrinet and InfiniBand interconnect show significant reductions in running time for all three operations sometimes even close to the best possible factor of two.