Transparent neutral element elimination in MPI reduction operations

Authors:
Jesper Larsson Träff
Affiliations:
Department of Scientific Computing, University of Vienna, Vienna, Austria
Venue:
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Year:
2010

Citing 5
Cited 0

Reproducible Measurements of MPI Performance Characteristics

Proceedings of the 6th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Runtime Compression of MPI Messanes to Improve the Performance and Scalability of Parallel Applications

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
MPI Reduction Operations for Sparse Floating-point Data

Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
CoMPI: Enhancing MPI Based Applications Performance and Scalability Using Run-Time Compression

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Scalable communication protocols for dynamic sparse data exchange

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe simple and easy to implement MPI library internal functionality that enables MPI reduction operations to be performed more efficiently with increasing sparsity (fraction of neutral elements for the given operator) of the input (and intermediate result) vectors. Using this functionality we give an implementation of the MPI_Reduce collective operation that completely transparently to the application programmer exploits sparsity of both input and intermediate result vectors. Experiments carried out on a 64-core Intel Nehalem multi-core cluster with InfiniBand interconnect show considerable and worthwhile improvements as the sparsity of the input grows, about a factor of three with 1% non-zero elements which is close to best possible for the approach. The overhead incurred for dense vectors is negligible when compared to the same implementation not exploiting sparsity of input and intermediate results. The implemented SPS_Reduce function is for both very small and large vectors faster than the native MPI_Reduce of the used MPI library, indicating that the improvements reported are not artifacts of suboptimal reduction algorithms.