CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
A Gossip-Style Failure Detection Service
A Gossip-Style Failure Detection Service
BOINC: A System for Public-Resource Computing and Storage
GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Distributed computing in practice: the Condor experience: Research Articles
Concurrency and Computation: Practice & Experience - Grid Performance
The Computational and Storage Potential of Volunteer Computing
CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
Fault tolerant algorithms for heat transfer problems
Journal of Parallel and Distributed Computing
An intelligent management of fault tolerance in cluster using RADICMPI
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
A High-Level Interpreted MPI Library for Parallel Computing in Volunteer Environments
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Communication target selection for replicated MPI processes
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
A Robust and Efficient Message Passing Library for Volunteer Computing Environments
Journal of Grid Computing
Proactive process-level live migration and back migration in HPC environments
Journal of Parallel and Distributed Computing
A communication framework for fault-tolerant parallel execution
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Estimation of MPI application performance on volunteer environments
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Detection and correction of silent data corruption for large-scale high-performance computing
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Parallelizing heavyweight debugging tools with mpiecho
Parallel Computing
Hi-index | 0.00 |
The objective of this research is to convert ordinary idle PCs into virtual clusters for executing parallel applications. The paper introduces VolpexMPI that is designed to enable seamless forward application progress in the presence of frequent node failures as well as dynamically changing networks speeds and node execution speeds. Process replication is employed to provide robustness in such volatile environments. The central challenge in VolpexMPI design is to efficiently and automatically manage dynamically varying number of process replicas in different states of execution progress. The key fault tolerance technique employed is fully distributed sender based logging. The paper presents the design and a prototype implementation of VolpexMPI. Preliminary results validate that the overhead of providing robustness is modest for applications having a favorable ratio of communication to computation and a low degree of communication.