Completely Asynchronous Optimistic Recovery with Minimal Rollbacks

Authors:
Sean W. Smith;David B. Johnson;J. D. Tygar
Affiliations:
-;-;-
Venue:
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Year:
1995

Citing 16
Cited 22

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Fault tolerance under UNIX

ACM Transactions on Computer Systems (TOCS)
Efficient distributed recovery using message logging

Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Recovery in distributed systems using optimistic message logging and check-pointing

Journal of Algorithms
Real-time, concurrent checkpoint for parallel programs

PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
Lightweight causal and atomic group multicast

ACM Transactions on Computer Systems (TOCS)
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Concurrent Robust Checkpointing and Recovery in Distributed Systems

Proceedings of the Fourth International Conference on Data Engineering
A message system supporting fault tolerance

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Publishing: a reliable broadcast communication mechanism

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
A Theory of Distributed Time

A Theory of Distributed Time
Signed Vector Timestamps: A Secure Protocol for Partial Order Time

Signed Vector Timestamps: A Secure Protocol for Partial Order Time
Distributed system fault tolerance using message logging and checkpointing

Distributed system fault tolerance using message logging and checkpointing
State Restoration in Systems of Communicating Processes

IEEE Transactions on Software Engineering

Atomicity in electronic commerce

PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Atomicity in electronic commerce

netWorker
Supporting Cost-Effective Fault Tolerance in Distributed Message-Passing Applications with File Operations

The Journal of Supercomputing
Easing the management of data-parallel systems via adaptation

EW 9 Proceedings of the 9th workshop on ACM SIGOPS European workshop: beyond the PC: new challenges for the operating system
Efficient Garbage Collection Schemes for Causal Message Logging with Independent Checkpointing

The Journal of Supercomputing
Asynchronous recovery without using vector timestamps

Journal of Parallel and Distributed Computing
Performance Evaluation of Fault Tolerance for Parallel Applications in Networked Environments

ICPP '97 Proceedings of the international Conference on Parallel Processing
SAM: A Flexible and Secure Auction Architecture Using Trusted Hardware

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Efficient Garbage Collection Schemes for Causal Message Logging with Independent Checkpointing in Message Passing Systems

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Atomicity versus Anonymity: Distributed Transactions for Electronic Commerce

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Checkpoint-Recovery for Mobile Intelligent Networks

Proceedings of the 14th International conference on Industrial and engineering applications of artificial intelligence and expert systems: engineering of intelligent systems
An Efficient Optimistic Message Logging Scheme for Recoverable Mobile Computing Systems

IEEE Transactions on Mobile Computing
Supporting fault-tolerance in heterogeneous distributed applications

HCW '97 Proceedings of the 6th Heterogeneous Computing Workshop (HCW '97)
Minimizing timestamp size for completely asynchronous optimistic recovery with minimal rollback

SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Distributed recovery with K-optimistic logging

Journal of Parallel and Distributed Computing
A causal message logging protocol for mobile nodes in mobile computing systems

Future Generation Computer Systems - Special issue: Advanced services for clusters and internet computing
Flashback: a lightweight extension for rollback and deterministic replay for software debugging

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Secure coprocessors in electronic commerce applications

WOEC'95 Proceedings of the 1st conference on USENIX Workshop on Electronic Commerce - Volume 1
Active Optimistic Message Logging for Reliable Execution of MPI Applications

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Message fragment based causal message logging

Journal of Parallel and Distributed Computing
Garbage collection in a causal message logging protocol

HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
Mobile agent based fault-tolerance support for the reliable mobile computing systems

COORDINATION'05 Proceedings of the 7th international conference on Coordination Models and Languages

Quantified Score

Hi-index	0.00

Visualization

Abstract

Consider the problem of transparently recovering an asynchronous distributed computation when one or more processes fail. Basing rollback recovery on optimistic message logging and replay is desirable for several reasons, including not requiring synchronization between processes during failure-free operation. However, previous optimistic rollback recovery protocols either have required synchronization during recovery, or have permitted a failure at one process to potentially trigger an exponential number of process rollbacks. In this paper, we present an optimistic rollback recovery protocol that provides completely asynchronous recovery, while also reducing the number of times a process must roll back in response to a failure to at most one. This protocol is based on comparing timestamp vectors across multiple levels of partial order time.