Communications of the ACM - Special issue on parallelism
Scans as Primitive Parallel Operations
IEEE Transactions on Computers
An introduction to parallel algorithms
An introduction to parallel algorithms
Pipelined parallel prefix computations, and sorting on a pipelined hypercube
Journal of Parallel and Distributed Computing
Efficient parallel prefix algorithms on mulitport message-passing systems
Information Processing Letters
MPI-The Complete Reference, Volume 1: The MPI Core
MPI-The Complete Reference, Volume 1: The MPI Core
Optimal and efficient algorithms for summing and prefix summing on parallel machines
Journal of Parallel and Distributed Computing
A bandwidth latency tradeoff for broadcast and reduction
Information Processing Letters
Vector Prefix and Reduction Computation on Coarse-Grained, Distributed-Memory Parallel Machines
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Pipelining and Overlapping for MPI Collective Operations
LCN '03 Proceedings of the 28th Annual IEEE International Conference on Local Computer Networks
Full bandwidth broadcast, reduction and scan with only two trees
PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Pattern-independent detection of manual collectives in MPI programs
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Design of a low-energy data processing architecture for WSN nodes
DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
Hi-index | 0.00 |
We describe and experimentally compare four theoretically well-known algorithms for the parallel prefix operation (scan, in MPI terms), and give a presumably novel, doubly-pipelined implementation of the in-order binary tree parallel prefix algorithm. Bidirectional interconnects can benefit from this implementation. We present results from a 32 node AMD Cluster with Myrinet 2000 and a 72-node SX-8 parallel vector system. The doubly-pipelined algorithm is more than a factor two faster than the straight-forward binomial-tree algorithm found in many MPI implementations. However, due to its small constant factors the simple, linear pipeline algorithm is preferable for systems with a moderate number of processors. We also discuss adapting the algorithms to clusters of SMP nodes.