The Design and Verification of the Rio File Cache
IEEE Transactions on Computers
Improving the reliability of commodity operating systems
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Improving the reliability of commodity operating systems
ACM Transactions on Computer Systems (TOCS)
ACM Transactions on Computer Systems (TOCS)
Emulation of Software Faults: A Field Data Study and a Practical Approach
IEEE Transactions on Software Engineering
Experiences in measuring the reliability of a cache-based storage system
WIESS'00 Proceedings of the 1st conference on Industrial Experiences with Systems Software - Volume 1
Towards availability benchmarks: a case study of software raid systems
ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference
SafeDrive: safe and recoverable extensions using language-based techniques
OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Otherworld: giving applications a chance to survive OS kernel crashes
Proceedings of the 5th European conference on Computer systems
CuriOS: improving reliability through operating system structure
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Device driver safety through a reference validation mechanism
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
ReHype: enabling VM survival across hypervisor failures
Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Efficient Testing of Recovery Code Using Fault Injection
ACM Transactions on Computer Systems (TOCS)
Compiler support for fine-grain software-only checkpointing
CC'12 Proceedings of the 21st international conference on Compiler Construction
Is Linux kernel oops useful or not?
HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
Back to the future: fault-tolerant live update with time-traveling state transfer
LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
Hi-index | 0.00 |
Fault injection is typically used to characterize failures and to validate and compare fault-tolerant mechanisms. However, fault injection is rarely used for all these purposes to guide the design and implementation of a fault-tolerant system. We present a systematic and quantitative approach for using software-implemented fault injection to guide the design and implementation of a fault-tolerant system. Our system design goal is to build a write-back file cache on Intel PCs that is as reliable as a write-through file cache. We follow an iterative approach to improve robustness in the presence of operating system errors. In each iteration, we measure the reliability of the system, analyze the fault symptoms that lead to data corruption, and apply fault-tolerant mechanisms that address the fault symptoms. Our initial system is 13 times less reliable than a write-through file cache. The result of several iterations is a design that is both more reliable (1.9% vs. 3.1% corruption rate) and 5-9 times as fast as a write-through file cache.