VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes

Authors:
Troy Leblanc;Rakhi Anand;Edgar Gabriel;Jaspal Subhlok
Affiliations:
Department of Computer Science, University of Houston,;Department of Computer Science, University of Houston,;Department of Computer Science, University of Houston,;Department of Computer Science, University of Houston,
Venue:
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Year:
2009

Citing 8
Cited 8

MPI/FTTM: Architecture and Taxonomies for Fault-Tolerant, Message-Passing Middleware for Performance-Portable Parallel Computing

CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
A Gossip-Style Failure Detection Service

A Gossip-Style Failure Detection Service
BOINC: A System for Public-Resource Computing and Storage

GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Distributed computing in practice: the Condor experience: Research Articles

Concurrency and Computation: Practice & Experience - Grid Performance
The Computational and Storage Potential of Volunteer Computing

CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
Fault tolerant algorithms for heat transfer problems

Journal of Parallel and Distributed Computing
An intelligent management of fault tolerance in cluster using RADICMPI

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface

A High-Level Interpreted MPI Library for Parallel Computing in Volunteer Environments

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Communication target selection for replicated MPI processes

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
A Robust and Efficient Message Passing Library for Volunteer Computing Environments

Journal of Grid Computing
Proactive process-level live migration and back migration in HPC environments

Journal of Parallel and Distributed Computing
A communication framework for fault-tolerant parallel execution

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Estimation of MPI application performance on volunteer environments

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Detection and correction of silent data corruption for large-scale high-performance computing

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Parallelizing heavyweight debugging tools with mpiecho

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The objective of this research is to convert ordinary idle PCs into virtual clusters for executing parallel applications. The paper introduces VolpexMPI that is designed to enable seamless forward application progress in the presence of frequent node failures as well as dynamically changing networks speeds and node execution speeds. Process replication is employed to provide robustness in such volatile environments. The central challenge in VolpexMPI design is to efficiently and automatically manage dynamically varying number of process replicas in different states of execution progress. The key fault tolerance technique employed is fully distributed sender based logging. The paper presents the design and a prototype implementation of VolpexMPI. Preliminary results validate that the overhead of providing robustness is modest for applications having a favorable ratio of communication to computation and a low degree of communication.