Exploring failure transparency and the limits of generic recovery

Authors:
David E. Lowell;Subhachandra Chandra;Peter M. Chen
Affiliations:
Western Research Laboratory, Compaq Computer Corporation;Department of Electrical Engineering and Computer Science, University of Michigan;Department of Electrical Engineering and Computer Science, University of Michigan
Venue:
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Year:
2000

Citing 21
Cited 31

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Fault tolerance under UNIX

ACM Transactions on Computer Systems (TOCS)
Recovery in distributed systems using asynchronous message logging and checkpointing

PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Hypervisor-based fault tolerance

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
The Rio file cache: surviving operating system crashes

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Free transactions with Rio Vista

Proceedings of the sixteenth ACM symposium on Operating systems principles
Efficient transparent application recovery in client-server information systems

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Chameleon: A Software Infrastructure for Adaptive Fault Tolerance

IEEE Transactions on Parallel and Distributed Systems
Byzantine generals in action: implementing fail-stop processors

ACM Transactions on Computer Systems (TOCS)
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Whither Generic Recovery from Application Faults? A Fault Study using Open-Source Software

DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
An Analysis of Communication-Induced Checkpointing

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
A NonStop kernel

SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
Maximum and minimum consistent global checkpoints and their applications

SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
Theory and practice of failure transparency

Theory and practice of failure transparency
An evaluation of the recovery-related properties of software faults

An evaluation of the recovery-related properties of software faults
NT-SwiFT: software implemented fault tolerance on windows NT

WINSYM'98 Proceedings of the 2nd conference on USENIX Windows NT Symposium - Volume 2
Libckpt: transparent checkpointing under Unix

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings

The Design and Verification of the Rio File Cache

IEEE Transactions on Computers
Improving availability with recursive microreboots: a soft-state system case study

Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
Finding and preventing run-time error handling mistakes

OOPSLA '04 Proceedings of the 19th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Susceptibility of Commodity Systems and Software to Memory Soft Errors

IEEE Transactions on Computers
Improving the reliability of commodity operating systems

ACM Transactions on Computer Systems (TOCS)
Rx: treating bugs as allergies---a safe method to survive software failures

Proceedings of the twentieth ACM symposium on Operating systems principles
Autonomous recovery in componentized Internet applications

Cluster Computing
Rewind, repair, replay: three R's to dependability

EW 10 Proceedings of the 10th workshop on ACM SIGOPS European workshop
Recovering device drivers

ACM Transactions on Computer Systems (TOCS)
Undo for operators: building an undoable e-mail store

ATEC '03 Proceedings of the annual conference on USENIX Annual Technical Conference
Flashback: a lightweight extension for rollback and deterministic replay for software debugging

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Correlating multi-session attacks via replay

HOTDEP'06 Proceedings of the 2nd conference on Hot Topics in System Dependability - Volume 2
Timeline: a high performance archive for a distributed object store

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Recovering device drivers

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Microreboot — A technique for cheap recovery

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Rethink the sync

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Rx: Treating bugs as allergies—a safe method to survive software failures

ACM Transactions on Computer Systems (TOCS)
Toward recovery-oriented computing

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Rethink the sync

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Exceptional situations and program reliability

ACM Transactions on Programming Languages and Systems (TOPLAS)
Debugging debugged, a metaphysical manifesto of systems integration

ACM SIGSOFT Software Engineering Notes
Rethink the sync

ACM Transactions on Computer Systems (TOCS)
Recoverable class loaders for a fast restart of Java applications

Mobile Networks and Applications
Recovery domains: an organizing principle for recoverable operating systems

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Respec: efficient online multiprocessor replayvia speculation and external determinism

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Otherworld: giving applications a chance to survive OS kernel crashes

Proceedings of the 5th European conference on Computer systems
"Otherworld": giving applications a chance to survive OS kernel crashes

HotDep'08 Proceedings of the Fourth conference on Hot topics in system dependability
Correlating multi-session attacks via replay

HotDep'06 Proceedings of the Second conference on Hot topics in system dependability
Operating system implications of fast, cheap, non-volatile memory

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Detecting and surviving data races using complementary schedules

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Composable reliability for asynchronous systems

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference

Quantified Score

Hi-index	0.01

Visualization

Abstract

We explore the abstraction of failure transparency in which the operating system provides the illusion of failure-free operation. To provide failure transparency, an operating system must recover applications after hardware, operating system, and application failures, and must do so without help from the programmer or unduly slowing failure-free performance. We describe two invariants that must be upheld to provide failure transparency: one that ensures sufficient application state is saved to guarantee the user cannot discern failures, and another that ensures sufficient application state is lost to allow recovery from failures affecting application state. We find that several real applications get failure transparency in the presence of simple stop failures with overhead of 0-12%. Less encouragingly, we find that applications violate one invariant in the course of upholding the other for more than 90% of application faults and 3-15% of operating system faults, rendering transparent recovery impossible for these cases.