Communication target selection for replicated MPI processes

Authors:
Rakhi Anand;Edgar Gabriel;Jaspal Subhlok
Affiliations:
Department of Computer Science, University of Houston;Department of Computer Science, University of Houston;Department of Computer Science, University of Houston
Venue:
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Year:
2010

Citing 6
Cited 1

BOINC: A System for Public-Resource Computing and Storage

GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Distributed computing in practice: the Condor experience: Research Articles

Concurrency and Computation: Practice & Experience - Grid Performance
Automatic Clustering of Grid Nodes

GRID '05 Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing
VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
An intelligent management of fault tolerance in cluster using RADICMPI

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface

A Robust and Efficient Message Passing Library for Volunteer Computing Environments

Journal of Grid Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

VolpexMPI is an MPI library designed for volunteer computing environments. In order to cope with the fundamental unreliability of these environments, VolpexMPI deploys two or more replicas of each MPI process. A receiver-driven communication scheme is employed to eliminate redundant message exchanges and sender based logging is employed to ensure seamless application progress with varying processor execution speeds and routine failures. In this model, to execute a receive operation, a decision has to be made as to which of the sending process replicas should be contacted first. Contacting the fastest replica appears to be the optimal local decision, but it can be globally non-optimal as it may slowdown the fastest replica. Further, identifying the fastest replica during execution is a challenge in itself. This paper evaluates various target selection algorithms to manage these trade-offs with the objective of minimizing the overall execution time. The algorithms are evaluated for the NAS Parallel Benchmarks utilizing heterogeneous network configurations, heterogeneous processor configurations and a combination of both.