Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Efficient distributed recovery using message logging
Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Recovery in distributed systems using optimistic message logging and check-pointing
Journal of Algorithms
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems
IEEE Transactions on Parallel and Distributed Systems
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints
IEEE Transactions on Computers
On Coordinated Checkpointing in Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Staggered Consistent Checkpointing
IEEE Transactions on Parallel and Distributed Systems
Quasi-Synchronous Checkpointing: Models, Characterization, and Classification
IEEE Transactions on Parallel and Distributed Systems
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
An Efficient Protocol for Checkpointing Recovery in Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Efficient Algorithms for Crash Recovery in Distributed Systems
Proceedings of the Tenth Conference on Foundations of Software Technology and Theoretical Computer Science
A low-overhead recovery technique using quasi-synchronous checkpointing
ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
Parallel Checkpoint/Restart without Message Logging
ICPP '00 Proceedings of the 2000 International Workshop on Parallel Processing
Performance analysis of different checkpointing and recovery schemes using stochastic model
Journal of Parallel and Distributed Computing
Self-stabilizing algorithm for checkpointing in a distributed system
Journal of Parallel and Distributed Computing
Journal of Parallel and Distributed Computing
Self-stabilizing checkpointing algorithm in ring topology
IWDC'05 Proceedings of the 7th international conference on Distributed Computing
Hi-index | 0.00 |
Checkpointing with rollback recovery is a well-known method for achieving fault-tolerance in distributed systems. In this work, we introduce algorithms for checkpointing and rollback recovery on asynchronous unidirectional and bi-directional ring networks. The proposed checkpointing algorithms can handle multiple concurrent initiations by different processes. While taking checkpoints, processes do not have to take into consideration any application message dependency. The synchronization is achieved by passing control messages among the processes. Application messages are acknowledged. Each process maintains a list of unacknowledged messages. Here we use a logical checkpoint, which is a standard checkpoint (i.e., snapshot of the process) plus a list of messages that have been sent by this process but are unacknowledged at the time of taking the checkpoint. The worst case message complexity of the proposed checkpointing algorithm is O(kn) when k initiators initiate concurrently. The time complexity is O(n). For the recovery algorithm, time and message complexities are both O(n).