Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Checkpointing and rollback-recovery algorithms in distributed systems
Journal of Systems and Software - Special issue on fault tolerance in real-time systems
Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems
IEEE Transactions on Parallel and Distributed Systems
Communication-Induced Determination of Consistent Snapshots
IEEE Transactions on Parallel and Distributed Systems
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
An Efficient Protocol for Checkpointing Recovery in Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Checkpointing with mutable checkpoints
Theoretical Computer Science - Dependable computing
ICPP '98 Proceedings of the 1998 International Conference on Parallel Processing
Preventing Useless Checkpoints in Distributed Computations
SRDS '97 Proceedings of the 16th Symposium on Reliable Distributed Systems
Hi-index | 0.00 |
In this paper, the concept of “computing checkpoint” is introduced, and then an efficient coordinated checkpoint algorithm is proposed. The algorithm combines the two approaches of reducing the overhead associated with coordinated checkpointing, which one is to minimize the processes which take checkpoints and the other is to make the checkpointing process non-blocking. Through piggybacking the information including which processes have taken new checkpoint in the broadcast committing message, the checkpoint sequence number of every process can be kept consistent in all processes, so that the unnecessary checkpoints and orphan messages can be avoided in the future running. Evaluation result shows that the number of redundant computing checkpoints is less than 1/10 of the number of tentative checkpoints. Analyses and experiments show that the overhead of our algorithm is lower than that of other coordinated checkpoint algorithms