Optimum Broadcasting and Personalized Communication in Hypercubes
IEEE Transactions on Computers
Methods and problems of communication in usual networks
Proceedings of the international workshop on Broadcasting and gossiping 1990
Designing broadcasting algorithms in the Postal Model for message-passing systems
Proceedings of the 4th ACM symposium on Parallel algorithms and architectures
Broadcasting multiple messages in simultaneous send/receive systems
Discrete Applied Mathematics
On the Design and Implementation of Broadcast and Global Combine Operations Using the Postal Model
IEEE Transactions on Parallel and Distributed Systems
Broadcasting on meshes with wormhole routing
Journal of Parallel and Distributed Computing
LogP: a practical model of parallel computation
Communications of the ACM
LogGP: incorporating long messages into the LogP model for parallel computation
Journal of Parallel and Distributed Computing
Optimal and near-optimal algorithms for k-item broadcast
Journal of Parallel and Distributed Computing
Optimal multiple message broadcasting in telephone-like communication systems
Discrete Applied Mathematics
Building a high-performance collective communication library
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Computing Global Combine Operations in the Multiport Postal Model
IEEE Transactions on Parallel and Distributed Systems
Broadcast/Multicast over Myrinet Using NIC-Assisted Multidestination Messages
CANPC '00 Proceedings of the 4th International Workshop on Network-Based Parallel Computing: Communication, Architecture, and Applications
A bandwidth latency tradeoff for broadcast and reduction
Information Processing Letters
Fast Collective Operations Using Shared and Remote Memory Access Protocols on Clusters
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Performance Analysis of MPI Collective Operations
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 15 - Volume 16
Optimizing Collective Communications on SMP Clusters
ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
On optimizing collective communication
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Collective operations in NEC's high-performance MPI libraries
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Pipelined broadcast on ethernet switched clusters
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Optimal broadcast for fully connected networks
HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
Process cooperation in multiple message broadcast
PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Process cooperation in multiple message broadcast
Parallel Computing
Two-tree algorithms for full bandwidth broadcast, reduction and scan
Parallel Computing
Toward performance models of MPI implementations for understanding application scaling issues
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Scalability limits of Bag-of-Tasks applications running on hierarchical platforms
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
We develop and implement an optimal broadcast algorithm for fully connected processor networks under a bidirectional communication model in which each processor can simultaneously send a message to one processor and receive a message from another, possibly different processor. For any number of processors p the algorithm requires N-1+@?logp@? communication rounds to broadcast N blocks of data from a root processor to the remaining processors, meeting the lower bound in the model. For data of size m, assuming that sending and receiving data of size m^' takes time @a+@bm^', the best running time that can be achieved by the division of m into equal-sized blocks is ((@?logp@?-1)@a+@bm)^2. The algorithm uses a regular, circulant graph communication pattern, and degenerates into a binomial tree broadcast when the number of blocks to be broadcast is one. The algorithm is furthermore well suited to fully connected clusters of SMP (Symmetric Multi-Processor) nodes. The algorithm is implemented as part of an MPI (Message Passing Interface) library. We demonstrate significant practical bandwidth improvements of up to a factor 1.5 over several other, commonly used broadcast algorithms on both a small SMP cluster and a 72 node NEC SX vector supercomputer.