A novel low-overhead recovery approach for distributed systems

Authors:
B. Gupta;S. Rahimi
Affiliations:
Computer Science Department, Southern Illinois University, Carbondale, IL;Computer Science Department, Southern Illinois University, Carbondale, IL
Venue:
Journal of Computer Systems, Networks, and Communications
Year:
2009

Citing 15
Cited 0

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Efficient algorithms for crash recovery in distributed systems

FST and TC 10 Proceedings of the tenth conference on Foundations of software technology and theoretical computer science
Fault tolerance in distributed systems

Fault tolerance in distributed systems
Consistent global checkpoints based on direct dependency tracking

Information Processing Letters
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Optimistic Crash Recovery without Changing Application Messages

IEEE Transactions on Parallel and Distributed Systems
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints

IEEE Transactions on Computers
On Coordinated Checkpointing in Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Mutable Checkpoints: A New Checkpointing Approach for Mobile Computing Systems

IEEE Transactions on Parallel and Distributed Systems
Advanced Concepts in Operating Systems

Advanced Concepts in Operating Systems
Asynchronous recovery without using vector timestamps

Journal of Parallel and Distributed Computing
Publishing: a reliable broadcast communication mechanism

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
How to recover efficiently and asynchronously when optimism fails

ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
An optimistic checkpointing and message logging approach for consistent global checkpoint collection in distributed systems

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We have addressed the complex problem of recovery for concurrent failures in distributed computing environment. We have proposed a new approach in which we have effectively dealt with both orphan and lost messages. The proposed checkpointing and recovery approaches enable each process to restart from its recent checkpoint and hence guarantee the least amount of recomputation after recovery. It also means that a process needs to save only its recent local checkpoint. In this regard, we have introduced two new ideas. First, the proposed value of the common checkpointing interval is such that it enables an initiator process to log the minimum number of messages sent by each application process. Second, the determination of the lost messages is always done a priori by an initiator process; besides it is done while the normal distributed application is running. This is quite meaningful because it does not delay the recovery approach in any way.