The Systematic Improvement of Fault Tolerance in the Rio File Cache

  • Authors:
  • Wee Teck Ng;Peter M. Chen

  • Affiliations:
  • -;-

  • Venue:
  • FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

Fault injection is typically used to characterize failures and to validate and compare fault-tolerant mechanisms. However, fault injection is rarely used for all these purposes to guide the design and implementation of a fault-tolerant system. We present a systematic and quantitative approach for using software-implemented fault injection to guide the design and implementation of a fault-tolerant system. Our system design goal is to build a write-back file cache on Intel PCs that is as reliable as a write-through file cache. We follow an iterative approach to improve robustness in the presence of operating system errors. In each iteration, we measure the reliability of the system, analyze the fault symptoms that lead to data corruption, and apply fault-tolerant mechanisms that address the fault symptoms. Our initial system is 13 times less reliable than a write-through file cache. The result of several iterations is a design that is both more reliable (1.9% vs. 3.1% corruption rate) and 5-9 times as fast as a write-through file cache.