Complete Process Recovery: Using Vector Time to Handle Multiple Failures in Distributed Systems

Authors:
Golden G. Richard III;Mukesh Singhal
Affiliations:
-;-
Venue:
IEEE Parallel & Distributed Technology: Systems & Technology
Year:
1997

Citing 10
Cited 2

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Optimal checkpointing and local recording for domino-free rollback recovery

Information Processing Letters
Efficient distributed recovery using message logging

Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Recovery in distributed systems using optimistic message logging and check-pointing

Journal of Algorithms
An efficient implementation of vector clocks

Information Processing Letters
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing

PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
An Efficient Protocol for Checkpointing Recovery in Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Concurrent Robust Checkpointing and Recovery in Distributed Systems

Proceedings of the Fourth International Conference on Data Engineering

Design and Analysis of an Integrated Checkpointing and Recovery Scheme for Distributed Applications

IEEE Transactions on Knowledge and Data Engineering
Asynchronous recovery without using vector timestamps

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the number of processors in a distributed system and the running time of applications increases, the likelihood of processor failure increases. The failure of even a single processor can mandate restarting an entire application from scratch. With the appropriate recovery mechanisms, distributed applications can survive multiple failures and avoid complete restarts. The authors present a set of distributed recovery techniques called CPR (Complete Process Recovery) that utilize vector time to handle failures and address both consistent-state restoration and the associated message-handling issues. The latter is important, because some recovery techniques delegate the handling of lost or duplicate messages to the message-transport mechanism. This requires the ability to checkpoint the network layer (a serious restriction), because the network is generally unaware of the anomalous messages induced by process failure and recovery. CPR requires nonfailed processes to roll back at most once in response to a single failure, and has low message complexity. Processes required to roll back after a failure do so concurrently, which substantially decreases recovery delay after a failure has occurred.