Staggered Consistent Checkpointing
IEEE Transactions on Parallel and Distributed Systems
Communication-Induced Determination of Consistent Snapshots
IEEE Transactions on Parallel and Distributed Systems
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Asynchronous recovery without using vector timestamps
Journal of Parallel and Distributed Computing
Quantifying rollback propagation in distributed checkpointing
Journal of Parallel and Distributed Computing
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery
IEEE Transactions on Dependable and Secure Computing
Using Consistent Global Checkpoints to Synchronize Processes in Distributed Simulation
DS-RT '05 Proceedings of the 9th IEEE International Symposium on Distributed Simulation and Real-Time Applications
Design and performance evaluation of enhanced two-level recovery scheme
PDCN '08 Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks
An efficient computing-checkpoint based coordinated checkpoint algorithm
EUC'06 Proceedings of the 2006 international conference on Embedded and Ubiquitous Computing
Using computing checkpoints implement consistent low-cost non-blocking coordinated checkpointing
PDCAT'04 Proceedings of the 5th international conference on Parallel and Distributed Computing: applications and Technologies
Hi-index | 0.00 |
A useless checkpoint is a local checkpoint that cannot be part of a consistent global checkpoint. This paper addresses the following important problem. Given a set of processes that take (basic) local checkpoints in an independent and unknown way, the problem is to design a communication-induced checkpointing protocol that directs processes to take additional local (forced) checkpoints to ensure that no local checkpoint is useless.A general and efficient protocol answering this problem is proposed. It is shown that several existing protocols that solve the same problem are particular instances of it. The design of this general protocol is motivated by the use of communication-induced checkpointing protocols in ``consistent global checkpoint''-based distributed applications. Detection of stable or unstable properties, rollback-recovery, and determination of distributed breakpoints are examples of such applications.