Damage Assessment for Optimal Rollback Recovery

Authors:
Tein-Hsiang Lin;Kang G. Shin
Affiliations:
Microtec/Mentor Graphics, Santa Clara, CA;Univ. of Michigan, Ann Arbor
Venue:
IEEE Transactions on Computers
Year:
1998

Citing 13
Cited 0

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Optimal checkpointing and local recording for domino-free rollback recovery

Information Processing Letters
Optimal design and use of retry in fault-tolerant computer systems

Journal of the ACM (JACM)
Modeling and Measurement of Error Propagation in a Multimodule Computing System

IEEE Transactions on Computers
Efficient distributed recovery using message logging

Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Location of a Faulty Module in a Computing System

IEEE Transactions on Computers
Recovery in distributed systems using optimistic message logging and check-pointing

Journal of Algorithms
A Bayesian approach to fault classification

SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Reliability Issues in Computing System Design

ACM Computing Surveys (CSUR)
A Survey of Techniques for Synchronization and Recovery in Decentralized Computer Systems

ACM Computing Surveys (CSUR)
An Optimal Retry Policy Based on Fault Classification

IEEE Transactions on Computers
Process backup in producer-consumer systems

SOSP '77 Proceedings of the sixth ACM symposium on Operating systems principles

Quantified Score

Hi-index	14.98

Visualization

Abstract

Conventional schemes of rollback recovery with checkpointing for concurrent processes have overlooked an important problem: contamination of checkpoints as a result of error propagation among the cooperating processes. Error propagation is unavoidable due to imperfect detection mechanisms and random interprocess communications, and it could give rise to contaminated checkpoints which, in turn, result in unsuccessful rollbacks. To counter the problem of error propagation, a damage assessment model is developed to estimate the correctness of saved checkpoints under various circumstances. Using the result of damage assessment, determination of the "optimal" checkpoints for rollback recovery驴which minimize the average total recovery overhead驴is formulated and solved as a nonlinear integer programming problem. Integration of damage assessment into existing recovery schemes is also discussed.