Optimal broadcast for fully connected processor-node networks

Authors:
Jesper Larsson Träff;Andreas Ripke
Affiliations:
NEC Laboratories Europe, NEC Europe Ltd., Rathausallee 10, D-53757 Sankt Augustin, Germany;NEC Laboratories Europe, NEC Europe Ltd., Rathausallee 10, D-53757 Sankt Augustin, Germany
Venue:
Journal of Parallel and Distributed Computing
Year:
2008

Citing 23
Cited 4

Optimum Broadcasting and Personalized Communication in Hypercubes

IEEE Transactions on Computers
Methods and problems of communication in usual networks

Proceedings of the international workshop on Broadcasting and gossiping 1990
Designing broadcasting algorithms in the Postal Model for message-passing systems

Proceedings of the 4th ACM symposium on Parallel algorithms and architectures
Broadcasting multiple messages in simultaneous send/receive systems

Discrete Applied Mathematics
On the Design and Implementation of Broadcast and Global Combine Operations Using the Postal Model

IEEE Transactions on Parallel and Distributed Systems
Broadcasting on meshes with wormhole routing

Journal of Parallel and Distributed Computing
LogP: a practical model of parallel computation

Communications of the ACM
LogGP: incorporating long messages into the LogP model for parallel computation

Journal of Parallel and Distributed Computing
Optimal and near-optimal algorithms for k-item broadcast

Journal of Parallel and Distributed Computing
Optimal multiple message broadcasting in telephone-like communication systems

Discrete Applied Mathematics
Building a high-performance collective communication library

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Computing Global Combine Operations in the Multiport Postal Model

IEEE Transactions on Parallel and Distributed Systems
Broadcast/Multicast over Myrinet Using NIC-Assisted Multidestination Messages

CANPC '00 Proceedings of the 4th International Workshop on Network-Based Parallel Computing: Communication, Architecture, and Applications
A bandwidth latency tradeoff for broadcast and reduction

Information Processing Letters
Fast Collective Operations Using Shared and Remote Memory Access Protocols on Clusters

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Performance Analysis of MPI Collective Operations

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 15 - Volume 16
Optimizing Collective Communications on SMP Clusters

ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
On optimizing collective communication

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Collective communication on architectures that support simultaneous communication over multiple links

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Collective operations in NEC's high-performance MPI libraries

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Pipelined broadcast on ethernet switched clusters

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Optimal broadcast for fully connected networks

HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
Process cooperation in multiple message broadcast

PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface

Process cooperation in multiple message broadcast

Parallel Computing
Two-tree algorithms for full bandwidth broadcast, reduction and scan

Parallel Computing
Toward performance models of MPI implementations for understanding application scaling issues

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Scalability limits of Bag-of-Tasks applications running on hierarchical platforms

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We develop and implement an optimal broadcast algorithm for fully connected processor networks under a bidirectional communication model in which each processor can simultaneously send a message to one processor and receive a message from another, possibly different processor. For any number of processors p the algorithm requires N-1+@?logp@? communication rounds to broadcast N blocks of data from a root processor to the remaining processors, meeting the lower bound in the model. For data of size m, assuming that sending and receiving data of size m^' takes time @a+@bm^', the best running time that can be achieved by the division of m into equal-sized blocks is ((@?logp@?-1)@a+@bm)^2. The algorithm uses a regular, circulant graph communication pattern, and degenerates into a binomial tree broadcast when the number of blocks to be broadcast is one. The algorithm is furthermore well suited to fully connected clusters of SMP (Symmetric Multi-Processor) nodes. The algorithm is implemented as part of an MPI (Message Passing Interface) library. We demonstrate significant practical bandwidth improvements of up to a factor 1.5 over several other, commonly used broadcast algorithms on both a small SMP cluster and a 72 node NEC SX vector supercomputer.