Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Efficient algorithms for crash recovery in distributed systems
FST and TC 10 Proceedings of the tenth conference on Foundations of software technology and theoretical computer science
Fault tolerance in distributed systems
Fault tolerance in distributed systems
Consistent global checkpoints based on direct dependency tracking
Information Processing Letters
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Optimistic Crash Recovery without Changing Application Messages
IEEE Transactions on Parallel and Distributed Systems
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints
IEEE Transactions on Computers
On Coordinated Checkpointing in Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Mutable Checkpoints: A New Checkpointing Approach for Mobile Computing Systems
IEEE Transactions on Parallel and Distributed Systems
Advanced Concepts in Operating Systems
Advanced Concepts in Operating Systems
Asynchronous recovery without using vector timestamps
Journal of Parallel and Distributed Computing
Publishing: a reliable broadcast communication mechanism
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
How to recover efficiently and asynchronously when optimism fails
ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
We have addressed the complex problem of recovery for concurrent failures in distributed computing environment. We have proposed a new approach in which we have effectively dealt with both orphan and lost messages. The proposed checkpointing and recovery approaches enable each process to restart from its recent checkpoint and hence guarantee the least amount of recomputation after recovery. It also means that a process needs to save only its recent local checkpoint. In this regard, we have introduced two new ideas. First, the proposed value of the common checkpointing interval is such that it enables an initiator process to log the minimum number of messages sent by each application process. Second, the determination of the lost messages is always done a priori by an initiator process; besides it is done while the normal distributed application is running. This is quite meaningful because it does not delay the recovery approach in any way.