Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Memory exclusion: optimizing the performance of checkpointing systems
Software—Practice & Experience
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
An Efficient Protocol for Checkpointing Recovery in Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Message Logging: Pessimistic, Optimistic, Causal, and Optimal
IEEE Transactions on Software Engineering
Concurrent Robust Checkpointing and Recovery in Distributed Systems
Proceedings of the Fourth International Conference on Data Engineering
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Automated application-level checkpointing of MPI programs
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Egida: An Extensible Toolkit For Low-Overhead Fault-Tolerance
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
System structure for software fault tolerance
Proceedings of the international conference on Reliable software
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Parallel Computing on Heterogeneous Networks
Parallel Computing on Heterogeneous Networks
MPI: A Message-Passing Interface Standard
MPI: A Message-Passing Interface Standard
Process Introspection: A Heterogeneous Checkpoint/Restart Mechanism Based on Automatic Code Modification
A network-failure-tolerant message-passing system for terascale clusters
International Journal of Parallel Programming
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Hi-index | 0.00 |
The Chandy-Lamport checkpointing algorithm is widely used in fault tolerant implementations of MPI. However, it assumes the FIFO property of message passing, which is not guaranteed by the MPI standard at the application level. Therefore, this algorithm cannot serve as a basis for an implementation-independent fault tolerant MPI. In this paper, we present a variant of the Chandy-Lamport algorithm that does not rely on the FIFO property. This algorithm can be implemented on top of MPI and, hence, used for development of a supplement software component enabling the fault tolerance of any MPI implementation compliant with the MPI standard. We prove the correctness of the algorithm and analyze its performance. Experimental results demonstrating the efficiency of the algorithm are also presented.