Distributed recovery with K-optimistic logging

Authors:
Om P. Damani;Yi-Min Wang;Vijay K. Garg
Affiliations:
IBM T.J. Watson Research Center, Hawthorne, NY;Microsoft Research, Redmond, WA;Department of Elect. and Computer Engineering, University of Texas at Austin
Venue:
Journal of Parallel and Distributed Computing
Year:
2003

Citing 18
Cited 3

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Virtual time

ACM Transactions on Programming Languages and Systems (TOPLAS)
Fault tolerance under UNIX

ACM Transactions on Computer Systems (TOCS)
Efficient distributed recovery using message logging

Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Recovery in distributed systems using optimistic message logging and check-pointing

Journal of Algorithms
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Impossibility of distributed consensus with one faulty process

Journal of the ACM (JACM)
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
Understanding the message logging paradigm for masking process crashes

Understanding the message logging paradigm for masking process crashes
Progressive Retry for Software Failure Recovery in Message-Passing Applications

IEEE Transactions on Computers
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Minimizing timestamp size for completely asynchronous optimistic recovery with minimal rollback

SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
How to recover efficiently and asynchronously when optimism fails

ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
Completely Asynchronous Optimistic Recovery with Minimal Rollbacks

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Why Optimistic Message Logging Has Not Been Used in Telecommunications Systems

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
The Cost of Recovery in Message Logging Protocols

The Cost of Recovery in Message Logging Protocols

An optimistic checkpointing and message logging approach for consistent global checkpoint collection in distributed systems

Journal of Parallel and Distributed Computing
Active Optimistic Message Logging for Reliable Execution of MPI Applications

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Message fragment based causal message logging

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Fault-tolerance techniques based on checkpointing and message logging have been increasingly used in real-world applications to reduce service down-time. Most industrial applications have chosen pessimistic logging because it allows fast and localized recovery. The price that they must pay, however, is the high failure-free overhead. In this paper, we introduce the concept of K-optimistic logging where K is the degree of optimism that can be used to fine-tune the trade-off between failure-free overhead and recovery efficiency. Traditional pessimistic logging and optimistic logging then become the two extremes in the entire spectrum spanned by K-optimistic logging. Our results generalize several previously known protocols.Our approach is to prove that only dependencies on those states that may be lost upon a failure need to be tracked on-line, and so transitive dependency tracking can be performed with a variable-size vector. The size of the vector piggy-backed on a message then indicates the number of processes whose failures may revoke the message, and K corresponds to the upper bound on the vector size. Furthermore, the parameter K is dynamically tunable in response to changing system characteristics.