How to recover efficiently and asynchronously when optimism fails

Authors:
Affiliations:
Venue:
ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
Year:
1996

Citing 0
Cited 23

Optimistic distributed simulation based on transitive dependency tracking

Proceedings of the eleventh workshop on Parallel and distributed simulation
Supporting Cost-Effective Fault Tolerance in Distributed Message-Passing Applications with File Operations

The Journal of Supercomputing
Efficient Garbage Collection Schemes for Causal Message Logging with Independent Checkpointing

The Journal of Supercomputing
The Cost of Recovery in Message Logging Protocols

IEEE Transactions on Knowledge and Data Engineering
Asynchronous recovery without using vector timestamps

Journal of Parallel and Distributed Computing
Efficient Garbage Collection Schemes for Causal Message Logging with Independent Checkpointing in Message Passing Systems

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
An Efficient Optimistic Message Logging Scheme for Recoverable Mobile Computing Systems

IEEE Transactions on Mobile Computing
Minimizing timestamp size for completely asynchronous optimistic recovery with minimal rollback

SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Optimistic Recovery in Multi-Threaded Distributed Systems

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Distributed recovery with K-optimistic logging

Journal of Parallel and Distributed Computing
A causal message logging protocol for mobile nodes in mobile computing systems

Future Generation Computer Systems - Special issue: Advanced services for clusters and internet computing
Performance analysis of different checkpointing and recovery schemes using stochastic model

Journal of Parallel and Distributed Computing
Flashback: a lightweight extension for rollback and deterministic replay for software debugging

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Log-based recovery for middleware servers

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Speculations: providing fault-tolerance and recoverability in distributed environments

HOTDEP'06 Proceedings of the 2nd conference on Hot Topics in System Dependability - Volume 2
Novel Crash Recovery Approach for Concurrent Failures in Cluster Federation

GPC '09 Proceedings of the 4th International Conference on Advances in Grid and Pervasive Computing
A novel low-overhead recovery approach for distributed systems

Journal of Computer Systems, Networks, and Communications
Active Optimistic Message Logging for Reliable Execution of MPI Applications

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Improving message logging protocols scalability through distributed event logging

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Babylon v2.0: middleware for distributed, parallel, and mobile java applications

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Speculations: providing fault-tolerance and recoverability in distributed environments

HotDep'06 Proceedings of the Second conference on Hot topics in system dependability
Log-based middleware server recovery with transaction support

The VLDB Journal — The International Journal on Very Large Data Bases
Mobile agent based fault-tolerance support for the reliable mobile computing systems

COORDINATION'05 Proceedings of the 7th international conference on Coordination Models and Languages

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a new algorithm for recovering asynchronously from failures in a distributed computation. Our algorithm is based on two novel concepts-a fault-tolerant vector clock to maintain causality information in spite of failures, and a history mechanism to detect orphan states and obsolete messages. These two mechanisms together with checkpointing and message-logging are used to restore the system to a consistent state after a failure of one or more processes. Our algorithm is completely asynchronous. It handles multiple failures, does not assume any message ordering, causes the minimum amount of rollback and restores the maximum recoverable state with low overhead. Earlier optimistic protocols lack one or more of the above properties.