Preventing Useless Checkpoints in Distributed Computations

Authors:
Jean-Michel Helary;Achour Mostefaoui;Michel Raynal
Affiliations:
-;-;-
Venue:
SRDS '97 Proceedings of the 16th Symposium on Reliable Distributed Systems
Year:
1997

Citing 0
Cited 10

Staggered Consistent Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Communication-Induced Determination of Consistent Snapshots

IEEE Transactions on Parallel and Distributed Systems
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Asynchronous recovery without using vector timestamps

Journal of Parallel and Distributed Computing
Quantifying rollback propagation in distributed checkpointing

Journal of Parallel and Distributed Computing
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery

IEEE Transactions on Dependable and Secure Computing
Using Consistent Global Checkpoints to Synchronize Processes in Distributed Simulation

DS-RT '05 Proceedings of the 9th IEEE International Symposium on Distributed Simulation and Real-Time Applications
Design and performance evaluation of enhanced two-level recovery scheme

PDCN '08 Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks
An efficient computing-checkpoint based coordinated checkpoint algorithm

EUC'06 Proceedings of the 2006 international conference on Embedded and Ubiquitous Computing
Using computing checkpoints implement consistent low-cost non-blocking coordinated checkpointing

PDCAT'04 Proceedings of the 5th international conference on Parallel and Distributed Computing: applications and Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

A useless checkpoint is a local checkpoint that cannot be part of a consistent global checkpoint. This paper addresses the following important problem. Given a set of processes that take (basic) local checkpoints in an independent and unknown way, the problem is to design a communication-induced checkpointing protocol that directs processes to take additional local (forced) checkpoints to ensure that no local checkpoint is useless.A general and efficient protocol answering this problem is proposed. It is shown that several existing protocols that solve the same problem are particular instances of it. The design of this general protocol is motivated by the use of communication-induced checkpointing protocols in ``consistent global checkpoint''-based distributed applications. Detection of stable or unstable properties, rollback-recovery, and determination of distributed breakpoints are examples of such applications.