Independent Recovery in Large-Scale Distributed Systems
IEEE Transactions on Software Engineering
Revisiting commit processing in distributed database systems
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Recovery mechanisms in database systems
Recovery mechanisms in database systems
The Performance of Two Phase Commit Protocols in the Presence of Site Failures
Distributed and Parallel Databases
Principles of distributed database systems (2nd ed.)
Principles of distributed database systems (2nd ed.)
Fail-stop processors: an approach to designing fault-tolerant computing systems
ACM Transactions on Computer Systems (TOCS)
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
A Low-Cost Checkpointing Technique for Distributed Databases
Distributed and Parallel Databases
The Cost of Recovery in Message Logging Protocols
IEEE Transactions on Knowledge and Data Engineering
An Efficient Protocol for Checkpointing Recovery in Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Message Logging: Pessimistic, Optimistic, Causal, and Optimal
IEEE Transactions on Software Engineering
Concurrent Robust Checkpointing and Recovery in Distributed Systems
Proceedings of the Fourth International Conference on Data Engineering
Hi-index | 0.00 |
Recovering from node failures is a critical issue in distributed database systems. In conventional log-based recovery protocols, the nodes providing recovery service may be overburdened, especially when the recovery is resource consuming. In this paper, an agent-based dynamic recovery protocol is presented. It divides the whole recovery process into three major steps: log-recovery, agent-recovery, and synchronization. The key idea of this protocol is to cache new database operations initiated during recovery into agents. All these cached operations are then replayed independently for further recovery. The analysis indicates that the new protocol can minimize internode's dependency and improve recovery speed. As a result, system failure rate is cut down and the overall performance gets improved.