Journal of Parallel and Distributed Computing
Optimal computation of census functions in the postal model
Discrete Applied Mathematics
On the Design and Implementation of Broadcast and Global Combine Operations Using the Postal Model
IEEE Transactions on Parallel and Distributed Systems
Efficient Algorithms for the Reduce-Scatter Operation in LogGP
IEEE Transactions on Parallel and Distributed Systems
Computing Global Combine Operations in the Multiport Postal Model
IEEE Transactions on Parallel and Distributed Systems
Algorithms for Supporting Compiled Communication
IEEE Transactions on Parallel and Distributed Systems
Fast Collective Operations Using Shared and Remote Memory Access Protocols on Clusters
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Efficient Collective Operations Using Remote Memory Operations on VIA-Based Clusters
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
The Analysis and Optimization of Collective Communications on a Beowulf Cluster
ICPADS '02 Proceedings of the 9th International Conference on Parallel and Distributed Systems
Extending Collective Operations with Application Semantics for Improving Multi-Cluster Performance
ISPDC '04 Proceedings of the Third International Symposium on Parallel and Distributed Computing/Third International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Networks
Optimization of MPI collective communication on BlueGene/L systems
Proceedings of the 19th annual international conference on Supercomputing
An MPI prototype for compiled communication on Ethernet switched clusters
Journal of Parallel and Distributed Computing - Special issue: Design and performance of networks for super-, cluster-, and grid-computing: Part I
Efficient Barrier and Allreduce on Infiniband clusters using multicast and adaptive algorithms
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
An empirical study of reliable multicast protocols over Ethernet-connected networks
Performance Evaluation
A Message Scheduling Scheme for All-to-All Personalized Communication on Ethernet Switched Clusters
IEEE Transactions on Parallel and Distributed Systems
Techniques for pipelined broadcast on ethernet switched clusters
Journal of Parallel and Distributed Computing
Bandwidth efficient all-to-all broadcast on switched clusters
International Journal of Parallel Programming
A study of process arrival patterns for MPI collective operations
International Journal of Parallel Programming
Hi-index | 0.00 |
We consider an efficient realization of the all-reduce operation with large data sizes in cluster environments, under the assumption that the reduce operator is associative and commutative. We derive a tight lower bound of the amount of data that must be communicated in order to complete this operation and propose a ring-based algorithm that only requires tree connectivity to achieve bandwidth optimality. Unlike the widely used butterfly-like all-reduce algorithm that incurs network contention in SMP/multi-core clusters, the proposed algorithm can achieve contention-free communication in almost all contemporary clusters, including SMP/multi-core clusters and Ethernet switched clusters with multiple switches. We demonstrate that the proposed algorithm is more efficient than other algorithms on clusters with different nodal architectures and networking technologies when the data size is sufficiently large.