Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Optimal checkpointing and local recording for domino-free rollback recovery
Information Processing Letters
Optimal design and use of retry in fault-tolerant computer systems
Journal of the ACM (JACM)
Modeling and Measurement of Error Propagation in a Multimodule Computing System
IEEE Transactions on Computers
Efficient distributed recovery using message logging
Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Location of a Faulty Module in a Computing System
IEEE Transactions on Computers
Recovery in distributed systems using optimistic message logging and check-pointing
Journal of Algorithms
A Bayesian approach to fault classification
SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Reliability Issues in Computing System Design
ACM Computing Surveys (CSUR)
A Survey of Techniques for Synchronization and Recovery in Decentralized Computer Systems
ACM Computing Surveys (CSUR)
An Optimal Retry Policy Based on Fault Classification
IEEE Transactions on Computers
Process backup in producer-consumer systems
SOSP '77 Proceedings of the sixth ACM symposium on Operating systems principles
Hi-index | 14.98 |
Conventional schemes of rollback recovery with checkpointing for concurrent processes have overlooked an important problem: contamination of checkpoints as a result of error propagation among the cooperating processes. Error propagation is unavoidable due to imperfect detection mechanisms and random interprocess communications, and it could give rise to contaminated checkpoints which, in turn, result in unsuccessful rollbacks. To counter the problem of error propagation, a damage assessment model is developed to estimate the correctness of saved checkpoints under various circumstances. Using the result of damage assessment, determination of the "optimal" checkpoints for rollback recovery驴which minimize the average total recovery overhead驴is formulated and solved as a nonlinear integer programming problem. Integration of damage assessment into existing recovery schemes is also discussed.