On designing direct dependency: based fast recovery algorithms for distributed systems

Authors:
B. Gupta;Z. Liu;Z. Liang
Affiliations:
Southern Illinois University, Carbondale, IL;Southeast Missouri State University, Cape Girardeau, MO;Southern Illinois University, Carbondale, IL
Venue:
ACM SIGOPS Operating Systems Review
Year:
2004

Citing 11
Cited 1

Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Consistent global checkpoints based on direct dependency tracking

Information Processing Letters
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Optimistic Crash Recovery without Changing Application Messages

IEEE Transactions on Parallel and Distributed Systems
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints

IEEE Transactions on Computers
Theoretical Analysis for Communication-Induced Checkpointing Protocols with Rollback-Dependency Trackability

IEEE Transactions on Parallel and Distributed Systems
Reliability Issues in Computing System Design

ACM Computing Surveys (CSUR)
Principles of Distributed Systems

Principles of Distributed Systems
Advanced Concepts in Operating Systems

Advanced Concepts in Operating Systems
A Fast Recovery Scheme for Distributed Systems

PDPTA '02 Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications - Volume 1
A low-overhead recovery technique using quasi-synchronous checkpointing

ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)

Using Consistent Global Checkpoints to Synchronize Processes in Distributed Simulation

DS-RT '05 Proceedings of the 9th IEEE International Symposium on Distributed Simulation and Real-Time Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we have proposed two recovery algorithms for distributed systems. Both algorithms follow a revolving centralized scheme. The direct dependency tracking of an integer representing the number of messages sent by each process has been shown to be sufficient to determine the maximum consistent state. The main feature of the recovery algorithms is that they are executed simultaneously by all the participating processes while determining the maximum consistent state. It thus ensures fast execution. The time overheads of the recovery algorithms are reduced further because both algorithms avoid some unnecessary comparisons while determining a consistent global checkpoint. The second algorithm has been shown to be faster than the first one, because it avoids, in general, much larger number of unnecessary comparisons compared to the first one; however the trade off is the increased amount of control information to be stored at each checkpoint in the second algorithm.