Parallel prefix (scan) algorithms for MPI

Authors:
Peter Sanders;Jesper Larsson Träff
Affiliations:
Universität Karlsruhe, Karlsruhe, Germany;C&C Research Laboratories, NEC Europe Ltd., Sankt Augustin, Germany
Venue:
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Year:
2006

Citing 10
Cited 3

Data parallel algorithms

Communications of the ACM - Special issue on parallelism
Scans as Primitive Parallel Operations

IEEE Transactions on Computers
An introduction to parallel algorithms

An introduction to parallel algorithms
Pipelined parallel prefix computations, and sorting on a pipelined hypercube

Journal of Parallel and Distributed Computing
Efficient parallel prefix algorithms on mulitport message-passing systems

Information Processing Letters
MPI-The Complete Reference, Volume 1: The MPI Core

MPI-The Complete Reference, Volume 1: The MPI Core
Optimal and efficient algorithms for summing and prefix summing on parallel machines

Journal of Parallel and Distributed Computing
A bandwidth latency tradeoff for broadcast and reduction

Information Processing Letters
Vector Prefix and Reduction Computation on Coarse-Grained, Distributed-Memory Parallel Machines

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Pipelining and Overlapping for MPI Collective Operations

LCN '03 Proceedings of the 28th Annual IEEE International Conference on Local Computer Networks

Full bandwidth broadcast, reduction and scan with only two trees

PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Pattern-independent detection of manual collectives in MPI programs

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Design of a low-energy data processing architecture for WSN nodes

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe and experimentally compare four theoretically well-known algorithms for the parallel prefix operation (scan, in MPI terms), and give a presumably novel, doubly-pipelined implementation of the in-order binary tree parallel prefix algorithm. Bidirectional interconnects can benefit from this implementation. We present results from a 32 node AMD Cluster with Myrinet 2000 and a 72-node SX-8 parallel vector system. The doubly-pipelined algorithm is more than a factor two faster than the straight-forward binomial-tree algorithm found in many MPI implementations. However, due to its small constant factors the simple, linear pipeline algorithm is preferable for systems with a moderate number of processors. We also discuss adapting the algorithms to clusters of SMP nodes.