Necessary and Sufficient Conditions for Consistent Global Snapshots
IEEE Transactions on Parallel and Distributed Systems
Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems.
IEEE Transactions on Parallel and Distributed Systems
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints
IEEE Transactions on Computers
An Index-Based Checkpointing Algorithm for Autonomous Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Quasi-Synchronous Checkpointing: Models, Characterization, and Classification
IEEE Transactions on Parallel and Distributed Systems
Fail-stop processors: an approach to designing fault-tolerant computing systems
ACM Transactions on Computer Systems (TOCS)
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
An Analysis of Communication-Induced Checkpointing
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
A VP-Accordant Checkpointing Protocol Preventing Useless Checkpoints
SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
A low-overhead recovery technique using quasi-synchronous checkpointing
ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
On Properties of RDT Communication-Induced Checkpointing Protocols
IEEE Transactions on Parallel and Distributed Systems
Quantifying rollback propagation in distributed checkpointing
Journal of Parallel and Distributed Computing
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery
IEEE Transactions on Dependable and Secure Computing
Communication-based prevention of useless checkpoints in distributed computations
Distributed Computing
System structure for software fault tolerance
IEEE Transactions on Software Engineering
Hi-index | 0.89 |
In this paper, a checkpointing protocol based on loose synchronization is proposed. The protocol enables processes to take checkpoints at different frequencies so that each process can control its rollback distance. In traditional asynchronous and quasi-synchronous checkpointing protocols, the checkpoints that are not up-to-date may be used for recovery. As a result, the rollback distance is often difficult to control. In the proposed protocol, the checkpoint cycle of each process is dynamically adjusted using a pessimistic scheme so that strict 1-rollback is achieved; namely, one of the last two checkpoints of each process can be utilized for recovery.