Unreliable failure detectors for reliable distributed systems
Journal of the ACM (JACM)
Failure Detectors as First Class Objects
DOA '99 Proceedings of the International Symposium on Distributed Objects and Applications
XtremWeb: A Generic Global Computing System
CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
MPI: A Message-Passing Interface Standard
MPI: A Message-Passing Interface Standard
A gossip-style failure detection service
Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing
A peer-to-peer framework for robust execution of message passing parallel programs on grids
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Exploitation of a parallel clustering algorithm on commodity hardware with P2P-MPI
The Journal of Supercomputing
Reliable parallel programming model for distributed computing environments
Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Hi-index | 0.00 |
We present in this paper the recent developments done in P2P-MPI, a grid middleware, concerning the fault management, which covers fault-tolerance for applications and fault detection. P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Applications are monitored by a distributed set of external modules called failure detectors. The contribution of this paper is the analysis of the advantages and drawbacks of such detectors for a real implementation, and its integration in P2P-MPI. We pay especially attention to the reliability of the failure detection service and to the failure detection speed. We propose a variant of the binary round-robin protocol, which is more reliable than the application execution in any case. Experiments on applications of up to 256 processes, carried out on Grid'5000 show that the real detection times closely match the predictions.