An efficient protocol for checkpoint-based failure recovery in distributed systems

Authors:
D. Goswami;S. Sahu
Affiliations:
Indian Institute of Technology Guwahati, North Guwahati, India;Indian Institute of Technology Guwahati, North Guwahati, India
Venue:
ICDCIT'04 Proceedings of the First international conference on Distributed Computing and Internet Technology
Year:
2004

Citing 6
Cited 0

Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
On distributed snapshots

Information Processing Letters
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Use of Common Time Base for Checkpointing and Rollback Recovery in a Distributed System

IEEE Transactions on Software Engineering
Concurrent Robust Checkpointing and Recovery in Distributed Systems

Proceedings of the Fourth International Conference on Data Engineering
On the Impossibility of Min-Process Non-Blocking Checkpointing and An Efficient Checkpointing Algorithm for Mobile Computing Systems

ICPP '98 Proceedings of the 1998 International Conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Synchronous checkpointing is an attractive approach as it simplifies the process of failure recovery by storing a consistent global checkpoint Efforts have been made to minimize the number of synchronizing messages and the number of checkpoints in such an approach Taking the checkpoint without blocking the underlying computation is another important feature of the checkpointing process In this paper, we present a synchronous checkpointing algorithm which forces a minimum number of nodes to take a checkpoint Underlying computation needs to be blocked partially and only in rare cases The algorithm tolerates the failure of an arbitrary number of nodes during the progress Consistency of the checkpoint is ensured during the checkpointing process and hence no time needs to be spent during recovery.