MagPIe: MPI's collective communication operations for clustered wide area systems
Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
An Architecture of Stampi: MPI Library on a Cluster of Parallel Computers
Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
MPI_Connect Managing Heterogeneous MPI Applications Ineroperation and Process Control
Proceedings of the 5th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
Egida: An Extensible Toolkit For Low-Overhead Fault-Tolerance
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
MPICH-G2: a Grid-enabled implementation of the Message Passing Interface
Journal of Parallel and Distributed Computing - Special issue on computational grids
Entropia: architecture and performance of an enterprise desktop grid system
Journal of Parallel and Distributed Computing - Special issue on computational grids
BOINC: A System for Public-Resource Computing and Storage
GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Distributed computing in practice: the Condor experience: Research Articles
Concurrency and Computation: Practice & Experience - Grid Performance
The Computational and Storage Potential of Volunteer Computing
CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Characterizing resource availability in enterprise desktop grids
Future Generation Computer Systems
Fault tolerant algorithms for heat transfer problems
Journal of Parallel and Distributed Computing
Leveraging non-blocking collective communication in high-performance applications
Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Formal verification of practical MPI programs
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
On the dynamic resource availability in grids
GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
Measuring the Performance and Reliability of Production Computational Grids
GRID '06 Proceedings of the 7th IEEE/ACM International Conference on Grid Computing
VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Volunteer computing on clusters
JSSPP'06 Proceedings of the 12th international conference on Job scheduling strategies for parallel processing
Communication target selection for replicated MPI processes
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
An intelligent management of fault tolerance in cluster using RADICMPI
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
iShare – open internet sharing built on peer-to-peer and web
EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing
Hi-index | 0.02 |
The objective of this research is to convert ordinary idle PCs into virtual clusters for executing parallel applications. The paper presents VolpexMPI that is designed to enable seamless forward application progress in the presence of frequent node failures as well as dynamically changing networks and node execution speeds. Process replication is employed to provide robustness. The central challenge in the design of VolpexMPI is to efficiently and automatically manage dynamically varying number of process replicas in different states of execution progress. The key fault tolerance technique employed is fully distributed sender based logging. The paper presents the design and an implementation of VolpexMPI. Preliminary results validate that the overhead of providing robustness is modest for applications with a favorable ratio of communication to computation and a low degree of communication.