Exploring failure transparency and the limits of generic recovery

  • Authors:
  • David E. Lowell;Subhachandra Chandra;Peter M. Chen

  • Affiliations:
  • Western Research Laboratory, Compaq Computer Corporation;Department of Electrical Engineering and Computer Science, University of Michigan;Department of Electrical Engineering and Computer Science, University of Michigan

  • Venue:
  • OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
  • Year:
  • 2000

Quantified Score

Hi-index 0.01

Visualization

Abstract

We explore the abstraction of failure transparency in which the operating system provides the illusion of failure-free operation. To provide failure transparency, an operating system must recover applications after hardware, operating system, and application failures, and must do so without help from the programmer or unduly slowing failure-free performance. We describe two invariants that must be upheld to provide failure transparency: one that ensures sufficient application state is saved to guarantee the user cannot discern failures, and another that ensures sufficient application state is lost to allow recovery from failures affecting application state. We find that several real applications get failure transparency in the presence of simple stop failures with overhead of 0-12%. Less encouragingly, we find that applications violate one invariant in the course of upholding the other for more than 90% of application faults and 3-15% of operating system faults, rendering transparent recovery impossible for these cases.