Self-stabilizing algorithm for checkpointing in a distributed system

Authors:
Partha Sarathi Mandal;Krishnendu Mukhopadhyaya
Affiliations:
Advanced Computing and Microelectronics Unit, Indian Statistical Institute, 203 B. T. Road, Kolkata 700 108, India;Advanced Computing and Microelectronics Unit, Indian Statistical Institute, 203 B. T. Road, Kolkata 700 108, India
Venue:
Journal of Parallel and Distributed Computing
Year:
2007

Citing 19
Cited 2

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Efficient distributed recovery using message logging

Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Recovery in distributed systems using optimistic message logging and check-pointing

Journal of Algorithms
Self-stabilization

ACM Computing Surveys (CSUR)
Consistent global checkpoints based on direct dependency tracking

Information Processing Letters
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Fault-containing self-stabilizing algorithms

PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
On Coordinated Checkpointing in Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Staggered Consistent Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Quasi-Synchronous Checkpointing: Models, Characterization, and Classification

IEEE Transactions on Parallel and Distributed Systems
Self-stabilizing systems in spite of distributed control

Communications of the ACM
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Checkpointing with mutable checkpoints

Theoretical Computer Science - Dependable computing
Asynchronous recovery without using vector timestamps

Journal of Parallel and Distributed Computing
Scalable self-stabilization

Journal of Parallel and Distributed Computing - Self-stabilizing distributed systems
Consistent Logical Checkpointing

Consistent Logical Checkpointing
Causality tracking in causal message-logging protocols

Distributed Computing
Concurrent checkpoint initiation and recovery algorithms on asynchronous ring networks

Journal of Parallel and Distributed Computing

A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

Information Sciences: an International Journal
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

If the variables used for a checkpointing algorithm have data faults, the existing checkpointing and recovery algorithms may fail. In this paper, self-stabilizing data fault detecting and correcting, checkpointing, and recovery algorithms are proposed in a ring topology. The proposed data fault detection and correction algorithms can handle data faults; at most one per process, but in any number of processes. The proposed checkpointing algorithm can deal with concurrent multiple initiations of checkpointing and data faults. A process can recover from a fault, using the proposed recovery algorithm in spite of multiple data faults present in the system. All the proposed algorithms converge in O(n) steps, where n is the number of processes. The algorithm can be extended to work for general topologies too.