Efficient implementation of reduce-scatter in MPI

Authors:
Massimo Bernaschi;Giulio Iannello;Mario Lauria
Affiliations:
Istituto Applicazioni del Calcolo, CNR, Viale del Policlinico 137, I-00161 Rome, Italy;Dipartimento di Informatica e Sistemistica, Università di Napoli, v. Claudio, 21-80125 Napoli, Italy;Department of Computer and Information Science, The Ohio State University, 2015 Neil Ave, Columbus OH
Venue:
Journal of Systems Architecture: the EUROMICRO Journal - Special issue: Parallel, distributed and network-based processing
Year:
2003

Citing 15
Cited 6

LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimal broadcast and summation in the LogP model

SPAA '93 Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures
Efficient algorithms for all-to-all communications in multi-port message-passing systems

SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Designing broadcasting algorithms in the Postal Model for message-passing systems

Proceedings of the 4th ACM symposium on Parallel algorithms and architectures
CCL: A Portable and Tunable Collective Communication Library for Scalable Parallel Computers

IEEE Transactions on Parallel and Distributed Systems
Optimal computation of census functions in the postal model

Discrete Applied Mathematics
LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
MPI-FM: high performance MPI on workstation clusters

Journal of Parallel and Distributed Computing - Special issue on workstation clusters and network-based computing
Efficient Algorithms for the Reduce-Scatter Operation in LogGP

IEEE Transactions on Parallel and Distributed Systems
Concrete Math

Concrete Math
Building a high-performance collective communication library

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Myrinet: A Gigabit-per-Second Local Area Network

IEEE Micro
Vector Prefix and Reduction Computation on Coarse-Grained, Distributed-Memory Parallel Machines

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
MPI programming environment for IBM SP1/SP2

ICDCS '95 Proceedings of the 15th International Conference on Distributed Computing Systems
Document for a Standard Message-Passing Interface

Document for a Standard Message-Passing Interface

Performance Analysis of MPI Collective Operations

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 15 - Volume 16
Self-adapting numerical software (SANS) effort

IBM Journal of Research and Development
Performance analysis of MPI collective operations

Cluster Computing
MPI collective algorithm selection and quadtree encoding

Parallel Computing
MPI collective algorithm selection and quadtree encoding

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Decision trees and MPI collective algorithm selection problem

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing

Quantified Score

Hi-index	0.01

Visualization

Abstract

We discuss the efficient implementation of a collective operation called reduce-scatter, which is defined in the MPI standard. The reduce-scatter is equivalent to the combination of a reduction on vectors of length n with a scatter of the resulting n-vector to all processors.We describe the implementation issues and the performance characterization of two recently proposed algorithms for the reduce-scatter that have been proven to be highly efficient in theory under the assumption of fully connected parallel system.A performance comparison with existing mainstream implementations of the operation is presented which confirms the practical advantage of the new algorithms. Experiments show that the two algorithms have different characteristics which make them complementary in providing a performance gain over standard algorithms.Our study has been carried out on two different platforms: an SP2 and a Myrinet interconnected cluster of Pentium PRO. However, most of the results reported here are not specific for either MPI or the platforms used, and they hold in general for any message passing programming system.