The Systematic Improvement of Fault Tolerance in the Rio File Cache

Authors:
Wee Teck Ng;Peter M. Chen
Affiliations:
-;-
Venue:
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Year:
1999

Citing 0
Cited 16

The Design and Verification of the Rio File Cache

IEEE Transactions on Computers
Improving the reliability of commodity operating systems

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Improving the reliability of commodity operating systems

ACM Transactions on Computer Systems (TOCS)
Recovering device drivers

ACM Transactions on Computer Systems (TOCS)
Emulation of Software Faults: A Field Data Study and a Practical Approach

IEEE Transactions on Software Engineering
Experiences in measuring the reliability of a cache-based storage system

WIESS'00 Proceedings of the 1st conference on Industrial Experiences with Systems Software - Volume 1
Towards availability benchmarks: a case study of software raid systems

ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference
SafeDrive: safe and recoverable extensions using language-based techniques

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Otherworld: giving applications a chance to survive OS kernel crashes

Proceedings of the 5th European conference on Computer systems
CuriOS: improving reliability through operating system structure

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Device driver safety through a reference validation mechanism

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
ReHype: enabling VM survival across hypervisor failures

Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Efficient Testing of Recovery Code Using Fault Injection

ACM Transactions on Computer Systems (TOCS)
Compiler support for fine-grain software-only checkpointing

CC'12 Proceedings of the 21st international conference on Compiler Construction
Is Linux kernel oops useful or not?

HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
Back to the future: fault-tolerant live update with time-traveling state transfer

LISA'13 Proceedings of the 27th international conference on Large Installation System Administration

Quantified Score

Hi-index	0.00

Visualization

Abstract

Fault injection is typically used to characterize failures and to validate and compare fault-tolerant mechanisms. However, fault injection is rarely used for all these purposes to guide the design and implementation of a fault-tolerant system. We present a systematic and quantitative approach for using software-implemented fault injection to guide the design and implementation of a fault-tolerant system. Our system design goal is to build a write-back file cache on Intel PCs that is as reliable as a write-through file cache. We follow an iterative approach to improve robustness in the presence of operating system errors. In each iteration, we measure the reliability of the system, analyze the fault symptoms that lead to data corruption, and apply fault-tolerant mechanisms that address the fault symptoms. Our initial system is 13 times less reliable than a write-through file cache. The result of several iterations is a design that is both more reliable (1.9% vs. 3.1% corruption rate) and 5-9 times as fast as a write-through file cache.