Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
ACM Transactions on Computer Systems (TOCS)
Efficient distributed recovery using message logging
Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Recovery in distributed systems using optimistic message logging and check-pointing
Journal of Algorithms
Real-time, concurrent checkpoint for parallel programs
PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
Lightweight causal and atomic group multicast
ACM Transactions on Computer Systems (TOCS)
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Concurrent Robust Checkpointing and Recovery in Distributed Systems
Proceedings of the Fourth International Conference on Data Engineering
A message system supporting fault tolerance
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Publishing: a reliable broadcast communication mechanism
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
A Theory of Distributed Time
Signed Vector Timestamps: A Secure Protocol for Partial Order Time
Signed Vector Timestamps: A Secure Protocol for Partial Order Time
Distributed system fault tolerance using message logging and checkpointing
Distributed system fault tolerance using message logging and checkpointing
State Restoration in Systems of Communicating Processes
IEEE Transactions on Software Engineering
Atomicity in electronic commerce
PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Atomicity in electronic commerce
netWorker
The Journal of Supercomputing
Easing the management of data-parallel systems via adaptation
EW 9 Proceedings of the 9th workshop on ACM SIGOPS European workshop: beyond the PC: new challenges for the operating system
Efficient Garbage Collection Schemes for Causal Message Logging with Independent Checkpointing
The Journal of Supercomputing
Asynchronous recovery without using vector timestamps
Journal of Parallel and Distributed Computing
Performance Evaluation of Fault Tolerance for Parallel Applications in Networked Environments
ICPP '97 Proceedings of the international Conference on Parallel Processing
SAM: A Flexible and Secure Auction Architecture Using Trusted Hardware
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Atomicity versus Anonymity: Distributed Transactions for Electronic Commerce
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Checkpoint-Recovery for Mobile Intelligent Networks
Proceedings of the 14th International conference on Industrial and engineering applications of artificial intelligence and expert systems: engineering of intelligent systems
An Efficient Optimistic Message Logging Scheme for Recoverable Mobile Computing Systems
IEEE Transactions on Mobile Computing
Supporting fault-tolerance in heterogeneous distributed applications
HCW '97 Proceedings of the 6th Heterogeneous Computing Workshop (HCW '97)
Minimizing timestamp size for completely asynchronous optimistic recovery with minimal rollback
SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Distributed recovery with K-optimistic logging
Journal of Parallel and Distributed Computing
A causal message logging protocol for mobile nodes in mobile computing systems
Future Generation Computer Systems - Special issue: Advanced services for clusters and internet computing
Flashback: a lightweight extension for rollback and deterministic replay for software debugging
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Secure coprocessors in electronic commerce applications
WOEC'95 Proceedings of the 1st conference on USENIX Workshop on Electronic Commerce - Volume 1
Active Optimistic Message Logging for Reliable Execution of MPI Applications
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Message fragment based causal message logging
Journal of Parallel and Distributed Computing
Garbage collection in a causal message logging protocol
HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
Mobile agent based fault-tolerance support for the reliable mobile computing systems
COORDINATION'05 Proceedings of the 7th international conference on Coordination Models and Languages
Hi-index | 0.00 |
Consider the problem of transparently recovering an asynchronous distributed computation when one or more processes fail. Basing rollback recovery on optimistic message logging and replay is desirable for several reasons, including not requiring synchronization between processes during failure-free operation. However, previous optimistic rollback recovery protocols either have required synchronization during recovery, or have permitted a failure at one process to potentially trigger an exponential number of process rollbacks. In this paper, we present an optimistic rollback recovery protocol that provides completely asynchronous recovery, while also reducing the number of times a process must roll back in response to a failure to at most one. This protocol is based on comparing timestamp vectors across multiple levels of partial order time.