Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Consistent global checkpoints based on direct dependency tracking
Information Processing Letters
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Optimistic Crash Recovery without Changing Application Messages
IEEE Transactions on Parallel and Distributed Systems
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints
IEEE Transactions on Computers
IEEE Transactions on Parallel and Distributed Systems
Reliability Issues in Computing System Design
ACM Computing Surveys (CSUR)
Principles of Distributed Systems
Principles of Distributed Systems
Advanced Concepts in Operating Systems
Advanced Concepts in Operating Systems
A Fast Recovery Scheme for Distributed Systems
PDPTA '02 Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications - Volume 1
A low-overhead recovery technique using quasi-synchronous checkpointing
ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
Using Consistent Global Checkpoints to Synchronize Processes in Distributed Simulation
DS-RT '05 Proceedings of the 9th IEEE International Symposium on Distributed Simulation and Real-Time Applications
Hi-index | 0.00 |
In this paper, we have proposed two recovery algorithms for distributed systems. Both algorithms follow a revolving centralized scheme. The direct dependency tracking of an integer representing the number of messages sent by each process has been shown to be sufficient to determine the maximum consistent state. The main feature of the recovery algorithms is that they are executed simultaneously by all the participating processes while determining the maximum consistent state. It thus ensures fast execution. The time overheads of the recovery algorithms are reduced further because both algorithms avoid some unnecessary comparisons while determining a consistent global checkpoint. The second algorithm has been shown to be faster than the first one, because it avoids, in general, much larger number of unnecessary comparisons compared to the first one; however the trade off is the increased amount of control information to be stored at each checkpoint in the second algorithm.