Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Optimal checkpointing and local recording for domino-free rollback recovery
Information Processing Letters
Efficient distributed recovery using message logging
Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Recovery in distributed systems using optimistic message logging and check-pointing
Journal of Algorithms
An efficient implementation of vector clocks
Information Processing Letters
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
An Efficient Protocol for Checkpointing Recovery in Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Concurrent Robust Checkpointing and Recovery in Distributed Systems
Proceedings of the Fourth International Conference on Data Engineering
Design and Analysis of an Integrated Checkpointing and Recovery Scheme for Distributed Applications
IEEE Transactions on Knowledge and Data Engineering
Asynchronous recovery without using vector timestamps
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
As the number of processors in a distributed system and the running time of applications increases, the likelihood of processor failure increases. The failure of even a single processor can mandate restarting an entire application from scratch. With the appropriate recovery mechanisms, distributed applications can survive multiple failures and avoid complete restarts. The authors present a set of distributed recovery techniques called CPR (Complete Process Recovery) that utilize vector time to handle failures and address both consistent-state restoration and the associated message-handling issues. The latter is important, because some recovery techniques delegate the handling of lost or duplicate messages to the message-transport mechanism. This requires the ability to checkpoint the network layer (a serious restriction), because the network is generally unaware of the anomalous messages induced by process failure and recovery. CPR requires nonfailed processes to roll back at most once in response to a single failure, and has low message complexity. Processes required to roll back after a failure do so concurrently, which substantially decreases recovery delay after a failure has occurred.