An improved algorithm for (non-commutative) reduce-scatter with an application

Authors:
Jesper Larsson Träff
Affiliations:
C&C Research Laboratories, NEC Europe Ltd, Sankt Augustin, Germany
Venue:
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Year:
2005

Citing 4
Cited 3

Introduction to parallel algorithms and architectures: array, trees, hypercubes

Introduction to parallel algorithms and architectures: array, trees, hypercubes
Efficient Algorithms for the Reduce-Scatter Operation in LogGP

IEEE Transactions on Parallel and Distributed Systems
The implementation of MPI-2 one-sided communication for the NEC SX-5

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
MPI-The Complete Reference, Volume 1: The MPI Core

MPI-The Complete Reference, Volume 1: The MPI Core

A configurable algorithm for parallel image-compositing applications

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Collective operations in NEC's high-performance MPI libraries

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
mpicroscope: towards an MPI benchmark tool for performance guideline verification

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface

Quantified Score

Hi-index	0.00

Visualization

Abstract

The collective reduce-scatter operation in MPI performs an element-wise reduction using a given associative (and possibly commutative) binary operation of a sequence of m-element vectors, and distributes the result in mi sized blocks over the participating processors. For the case where the number of processors is a power of two, the binary operation is commutative, and all resulting blocks have the same size, efficient, butterfly-like algorithms are well-known and implemented in good MPI libraries. The contributions of this paper are threefold. First, we give a simple trick for extending the butterfly algorithm also to the case of non-commutative operations (which is advantageous also for the commutative case). Second, combining this with previous work, we give improved algorithms for the case where the number of processors is not a power of two. Third, we extend the algorithms also to the irregular case where the size of the resulting blocks may differ extremely. For p processors the algorithm requires ⌈log2p ⌉ + (⌈log2p ⌉ - $\lfloor log_2p \rfloor$) communication rounds for the regular case, which may double for the irregular case (depending on the amount of irregularity). For vectors of size m with $m = \sum^{p-1}_{i=0}m_i$ the total running time is O(log p + m), irrespective of whether the mi blocks are equal or not. The algorithm has been implemented, and on a small Myrinet cluster gives substantial improvements (up to a factor of 3 in the experiments reported) over other often used implementations. The reduce-scatter operation is a building block in the fence one-sided communication synchronization primitive, and for this application we also document worthwhile improvements over a previous implementation.