Recovering device drivers

Authors:
Michael M. Swift;Muthukaruppan Annamalai;Brian N. Bershad;Henry M. Levy
Affiliations:
University of Washington, Seattle, WA;University of Washington, Seattle, WA;University of Washington, Seattle, WA;University of Washington, Seattle, WA
Venue:
ACM Transactions on Computer Systems (TOCS)
Year:
2006

Citing 27
Cited 19

Fault tolerance under UNIX

ACM Transactions on Computer Systems (TOCS)
Efficient software-based fault isolation

SOSP '93 Proceedings of the fourteenth ACM symposium on Operating systems principles
On micro-kernel construction

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Exokernel: an operating system architecture for application-level resource management

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Hypervisor-based fault tolerance

ACM Transactions on Computer Systems (TOCS) - Special issue on operating system principles
The Rio file cache: surviving operating system crashes

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Dealing with disaster: surviving misbehaved kernel extensions

OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
The Flux OSKit: a substrate for kernel and language research

Proceedings of the sixteenth ACM symposium on Operating systems principles
Self-paging in the Nemesis operating system

OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
Integrating segmentation and paging protection for safe, efficient and transparent software extensions

Proceedings of the seventeenth ACM symposium on Operating systems principles
An empirical study of operating systems errors

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Transaction Processing: Concepts and Techniques

Transaction Processing: Concepts and Techniques
Software's Invisible Users

IEEE Software
Whither Generic Recovery from Application Faults? A Fault Study using Open-Source Software

DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Generation of an error set that emulates software faults based on field data

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
TFT: A Software System for Application-Transparent Fault Tolerance

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
How Fail-Stop are Faulty Programs?

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
The Systematic Improvement of Fault Tolerance in the Rio File Cache

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
A NonStop kernel

SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
Reliable hardware-software architecture

Proceedings of the international conference on Reliable software
Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel

HOTOS '01 Proceedings of the Eighth Workshop on Hot Topics in Operating Systems
Recovery Guarantees for General Multi-Tier Applications

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,

Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Making a Case for Efficient Supercomputing

Queue - Power Management
Improving the reliability of commodity operating systems

ACM Transactions on Computer Systems (TOCS)
Exploring failure transparency and the limits of generic recovery

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Libckpt: transparent checkpointing under Unix

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings

Tool support for new test criteria on embedded systems: Justitia

Proceedings of the 2nd international conference on Ubiquitous information management and communication
Live migration of direct-access devices

ACM SIGOPS Operating Systems Review
Fast byte-granularity software fault isolation

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Tolerating hardware device failures in software

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Surviving sensor network software faults

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
The cake is a lie: privilege rings as a policy resource

Proceedings of the 1st ACM workshop on Virtual machine security
Reverse engineering of binary device drivers with RevNIC

Proceedings of the 5th European conference on Computer systems
Otherworld: giving applications a chance to survive OS kernel crashes

Proceedings of the 5th European conference on Computer systems
Design of fault tolerant system based on runtime behavior tracing

ICACT'10 Proceedings of the 12th international conference on Advanced communication technology
Reverse-engineering drivers for safety and portability

HotDep'08 Proceedings of the Fourth conference on Hot topics in system dependability
Device driver safety through a reference validation mechanism

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Tolerating malicious device drivers in Linux

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Testing closed-source binary device drivers with DDT

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Live migration of direct-access devices

WIOV'08 Proceedings of the First conference on I/O virtualization
We crashed, now what?

HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
Faults in linux: ten years later

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
A comparative experimental study of software rejuvenation overhead

Performance Evaluation
Comprehending performance from real-world execution traces: a device-driver case

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Guardrail: a high fidelity approach to protecting hardware devices from buggy drivers

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article presents a new mechanism that enables applications to run correctly when device drivers fail. Because device drivers are the principal failing component in most systems, reducing driver-induced failures greatly improves overall reliability. Earlier work has shown that an operating system can survive driver failures [Swift et al. 2005], but the applications that depend on them cannot. Thus, while operating system reliability was greatly improved, application reliability generally was not.To remedy this situation, we introduce a new operating system mechanism called a shadow driver. A shadow driver monitors device drivers and transparently recovers from driver failures. Moreover, it assumes the role of the failed driver during recovery. In this way, applications using the failed driver, as well as the kernel itself, continue to function as expected.We implemented shadow drivers for the Linux operating system and tested them on over a dozen device drivers. Our results show that applications and the OS can indeed survive the failure of a variety of device drivers. Moreover, shadow drivers impose minimal performance overhead. Lastly, they can be introduced with only modest changes to the OS kernel and with no changes at all to existing device drivers.