The Impact of Recovery Mechanisms on the Likelihood of Saving Corrupted State

Authors:
Subhachandra Chandra;Peter M. Chen
Affiliations:
-;-
Venue:
ISSRE '02 Proceedings of the 13th International Symposium on Software Reliability Engineering
Year:
2002

Citing 0
Cited 8

Rx: treating bugs as allergies---a safe method to survive software failures

Proceedings of the twentieth ACM symposium on Operating systems principles
Rx: Treating bugs as allergies—a safe method to survive software failures

ACM Transactions on Computer Systems (TOCS)
Switchblade: enforcing dynamic personalized system call models

Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Remus: high availability via asynchronous virtual machine replication

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Otherworld: giving applications a chance to survive OS kernel crashes

Proceedings of the 5th European conference on Computer systems
"Otherworld": giving applications a chance to survive OS kernel crashes

HotDep'08 Proceedings of the Fourth conference on Hot topics in system dependability
Flikker: saving DRAM refresh-power through critical data partitioning

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Exception handling in the choices operating system

Advanced Topics in Exception Handling Techniques

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recovery systems must save state before a failure occursto enable the system to recover from the failure. However,recovery will fail if the recovery system saves any statecorrupted by the fault. The frequency and comprehensive-nessof how a recovery system saves state has a majoreffect on how often the recovery system inadvertentlysaves corrupted state. This paper explores and measuresthat effect. We measure how often software faults in theapplication and operating system cause real applicationsto save corrupted state when using different types of recov-erysystems. We find that generic recovery techniques, suchas checkpointing and logging, work well for faults in theoperating system. However, we find that they do not workwell for faults in the application because the very actionstaken to enable recovery often corrupt the state uponwhich successful recovery depends.