Asynchronous recovery without using vector timestamps

Authors:
D. Manivannan;M. Singhal
Affiliations:
Computer Science Department, University of Kentucky, 301 Rose street, James F. Hardymon Building, Lexington, KY;Computer Science Department, University of Kentucky, 301 Rose street, James F. Hardymon Building, Lexington, KY
Venue:
Journal of Parallel and Distributed Computing
Year:
2002

Citing 21
Cited 15

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Efficient distributed recovery using message logging

Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Recovery in distributed systems using optimistic message logging and check-pointing

Journal of Algorithms
Efficient algorithms for crash recovery in distributed systems

FST and TC 10 Proceedings of the tenth conference on Foundations of software technology and theoretical computer science
Necessary and Sufficient Conditions for Consistent Global Snapshots

IEEE Transactions on Parallel and Distributed Systems
Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems

IEEE Transactions on Parallel and Distributed Systems
An Index-Based Checkpointing Algorithm for Autonomous Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Quasi-Synchronous Checkpointing: Models, Characterization, and Classification

IEEE Transactions on Parallel and Distributed Systems
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Complete Process Recovery: Using Vector Time to Handle Multiple Failures in Distributed Systems

IEEE Parallel & Distributed Technology: Systems & Technology
An Efficient Protocol for Checkpointing Recovery in Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Finding Consistent Global Checkpoints in a Distributed Computation

IEEE Transactions on Parallel and Distributed Systems
An Analysis of Communication-Induced Checkpointing

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Preventing Useless Checkpoints in Distributed Computations

SRDS '97 Proceedings of the 16th Symposium on Reliable Distributed Systems
A VP-Accordant Checkpointing Protocol Preventing Useless Checkpoints

SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
A low-overhead recovery technique using quasi-synchronous checkpointing

ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
How to recover efficiently and asynchronously when optimism fails

ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
Completely Asynchronous Optimistic Recovery with Minimal Rollbacks

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Characterization of Consistent Global Checkpoints in Large-Scale Distributed Systems

FTDCS '95 Proceedings of the 5th IEEE Workshop on Future Trends of Distributed Computing Systems
Message logging: pessimistic, optimistic, and causal

ICDCS '95 Proceedings of the 15th International Conference on Distributed Computing Systems

Performance analysis of different checkpointing and recovery schemes using stochastic model

Journal of Parallel and Distributed Computing
Self-stabilizing algorithm for checkpointing in a distributed system

Journal of Parallel and Distributed Computing
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

Information Sciences: an International Journal
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

Information Sciences: an International Journal
An optimistic checkpointing and message logging approach for consistent global checkpoint collection in distributed systems

Journal of Parallel and Distributed Computing
Novel Crash Recovery Approach for Concurrent Failures in Cluster Federation

GPC '09 Proceedings of the 4th International Conference on Advances in Grid and Pervasive Computing
CrystalBall: predicting and preventing inconsistencies in deployed distributed systems

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Checkpointing and rollback recovery in distributed systems: existing solutions, open issues and proposed solutions

ICS'08 Proceedings of the 12th WSEAS international conference on Systems
A novel low-overhead recovery approach for distributed systems

Journal of Computer Systems, Networks, and Communications
Predicting and preventing inconsistencies in deployed distributed systems

ACM Transactions on Computer Systems (TOCS)
A novel recovery approach for cluster federations

GPC'07 Proceedings of the 2nd international conference on Advances in grid and pervasive computing
Domino-effect free crash recovery for concurrent failures in cluster federation

GPC'08 Proceedings of the 3rd international conference on Advances in grid and pervasive computing
An asynchronous recovery algorithm based on a staggered quasi-synchronous checkpointing algorithm

IWDC'05 Proceedings of the 7th international conference on Distributed Computing
Self-stabilizing checkpointing algorithm in ring topology

IWDC'05 Proceedings of the 7th international conference on Distributed Computing
HOPE: A Hybrid Optimistic checkpointing and selective Pessimistic mEssage logging protocol for large scale distributed systems

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

A checkpoint of a process involved in a distributed computation is said to be useful if it is part of a consistent global checkpoint. In this paper, we present a quasi-synchronous checkpointing algorithm that makes every checkpoint useful. We also present an efficient asynchronous recovery algorithm based on the checkpointing algorithm. The checkpointing algorithm allows the processes to take checkpoints asynchronously and also forces the processes to take additional checkpoints in order to make every checkpoint useful. The recovery algorithm can handle concurrent failure of multiple processes. The recovery algorithm has no domino effect and a failed process needs only to roll back to its latest checkpoint and request the other processes to roll back to a consistent checkpoint. Messages are only selectively logged to cope with various types of message abnormalities that arise due to rollback and hence results in low message logging overhead. Unlike some existing algorithms, our algorithm does not use vector timestamps for tracking dependency between checkpoints and hence results in low message overhead during failure-free operation. Moreover, a process can asynchronously decide garbage checkpoints and delete them from the stable storage--garbage checkpoints are the checkpoints that are no longer required for the purpose of recovery.