An improved algorithm for (non-commutative) reduce-scatter with an application

  • Authors:
  • Jesper Larsson Träff

  • Affiliations:
  • C&C Research Laboratories, NEC Europe Ltd, Sankt Augustin, Germany

  • Venue:
  • PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

The collective reduce-scatter operation in MPI performs an element-wise reduction using a given associative (and possibly commutative) binary operation of a sequence of m-element vectors, and distributes the result in mi sized blocks over the participating processors. For the case where the number of processors is a power of two, the binary operation is commutative, and all resulting blocks have the same size, efficient, butterfly-like algorithms are well-known and implemented in good MPI libraries. The contributions of this paper are threefold. First, we give a simple trick for extending the butterfly algorithm also to the case of non-commutative operations (which is advantageous also for the commutative case). Second, combining this with previous work, we give improved algorithms for the case where the number of processors is not a power of two. Third, we extend the algorithms also to the irregular case where the size of the resulting blocks may differ extremely. For p processors the algorithm requires ⌈log2p ⌉ + (⌈log2p ⌉ - $\lfloor log_2p \rfloor$) communication rounds for the regular case, which may double for the irregular case (depending on the amount of irregularity). For vectors of size m with $m = \sum^{p-1}_{i=0}m_i$ the total running time is O(log p + m), irrespective of whether the mi blocks are equal or not. The algorithm has been implemented, and on a small Myrinet cluster gives substantial improvements (up to a factor of 3 in the experiments reported) over other often used implementations. The reduce-scatter operation is a building block in the fence one-sided communication synchronization primitive, and for this application we also document worthwhile improvements over a previous implementation.