Optimistic distributed simulation based on transitive dependency tracking
Proceedings of the eleventh workshop on Parallel and distributed simulation
The Journal of Supercomputing
Efficient Garbage Collection Schemes for Causal Message Logging with Independent Checkpointing
The Journal of Supercomputing
The Cost of Recovery in Message Logging Protocols
IEEE Transactions on Knowledge and Data Engineering
Asynchronous recovery without using vector timestamps
Journal of Parallel and Distributed Computing
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
An Efficient Optimistic Message Logging Scheme for Recoverable Mobile Computing Systems
IEEE Transactions on Mobile Computing
Minimizing timestamp size for completely asynchronous optimistic recovery with minimal rollback
SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Optimistic Recovery in Multi-Threaded Distributed Systems
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Distributed recovery with K-optimistic logging
Journal of Parallel and Distributed Computing
A causal message logging protocol for mobile nodes in mobile computing systems
Future Generation Computer Systems - Special issue: Advanced services for clusters and internet computing
Performance analysis of different checkpointing and recovery schemes using stochastic model
Journal of Parallel and Distributed Computing
Flashback: a lightweight extension for rollback and deterministic replay for software debugging
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Log-based recovery for middleware servers
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Speculations: providing fault-tolerance and recoverability in distributed environments
HOTDEP'06 Proceedings of the 2nd conference on Hot Topics in System Dependability - Volume 2
Novel Crash Recovery Approach for Concurrent Failures in Cluster Federation
GPC '09 Proceedings of the 4th International Conference on Advances in Grid and Pervasive Computing
A novel low-overhead recovery approach for distributed systems
Journal of Computer Systems, Networks, and Communications
Active Optimistic Message Logging for Reliable Execution of MPI Applications
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Improving message logging protocols scalability through distributed event logging
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Babylon v2.0: middleware for distributed, parallel, and mobile java applications
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Speculations: providing fault-tolerance and recoverability in distributed environments
HotDep'06 Proceedings of the Second conference on Hot topics in system dependability
Log-based middleware server recovery with transaction support
The VLDB Journal — The International Journal on Very Large Data Bases
Mobile agent based fault-tolerance support for the reliable mobile computing systems
COORDINATION'05 Proceedings of the 7th international conference on Coordination Models and Languages
Hi-index | 0.00 |
We propose a new algorithm for recovering asynchronously from failures in a distributed computation. Our algorithm is based on two novel concepts-a fault-tolerant vector clock to maintain causality information in spite of failures, and a history mechanism to detect orphan states and obsolete messages. These two mechanisms together with checkpointing and message-logging are used to restore the system to a consistent state after a failure of one or more processes. Our algorithm is completely asynchronous. It handles multiple failures, does not assume any message ordering, causes the minimum amount of rollback and restores the maximum recoverable state with low overhead. Earlier optimistic protocols lack one or more of the above properties.