Introduction to parallel algorithms and architectures: array, trees, hypercubes
Introduction to parallel algorithms and architectures: array, trees, hypercubes
Efficient Algorithms for the Reduce-Scatter Operation in LogGP
IEEE Transactions on Parallel and Distributed Systems
The implementation of MPI-2 one-sided communication for the NEC SX-5
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
MPI-The Complete Reference, Volume 1: The MPI Core
MPI-The Complete Reference, Volume 1: The MPI Core
A configurable algorithm for parallel image-compositing applications
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Collective operations in NEC's high-performance MPI libraries
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
mpicroscope: towards an MPI benchmark tool for performance guideline verification
EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Hi-index | 0.00 |
The collective reduce-scatter operation in MPI performs an element-wise reduction using a given associative (and possibly commutative) binary operation of a sequence of m-element vectors, and distributes the result in mi sized blocks over the participating processors. For the case where the number of processors is a power of two, the binary operation is commutative, and all resulting blocks have the same size, efficient, butterfly-like algorithms are well-known and implemented in good MPI libraries. The contributions of this paper are threefold. First, we give a simple trick for extending the butterfly algorithm also to the case of non-commutative operations (which is advantageous also for the commutative case). Second, combining this with previous work, we give improved algorithms for the case where the number of processors is not a power of two. Third, we extend the algorithms also to the irregular case where the size of the resulting blocks may differ extremely. For p processors the algorithm requires ⌈log2p ⌉ + (⌈log2p ⌉ - $\lfloor log_2p \rfloor$) communication rounds for the regular case, which may double for the irregular case (depending on the amount of irregularity). For vectors of size m with $m = \sum^{p-1}_{i=0}m_i$ the total running time is O(log p + m), irrespective of whether the mi blocks are equal or not. The algorithm has been implemented, and on a small Myrinet cluster gives substantial improvements (up to a factor of 3 in the experiments reported) over other often used implementations. The reduce-scatter operation is a building block in the fence one-sided communication synchronization primitive, and for this application we also document worthwhile improvements over a previous implementation.