ACM Transactions on Computer Systems (TOCS)
Efficient software-based fault isolation
SOSP '93 Proceedings of the fourteenth ACM symposium on Operating systems principles
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Exokernel: an operating system architecture for application-level resource management
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Hypervisor-based fault tolerance
ACM Transactions on Computer Systems (TOCS) - Special issue on operating system principles
The Rio file cache: surviving operating system crashes
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Dealing with disaster: surviving misbehaved kernel extensions
OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
The Flux OSKit: a substrate for kernel and language research
Proceedings of the sixteenth ACM symposium on Operating systems principles
Self-paging in the Nemesis operating system
OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
Proceedings of the seventeenth ACM symposium on Operating systems principles
An empirical study of operating systems errors
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Transaction Processing: Concepts and Techniques
Transaction Processing: Concepts and Techniques
IEEE Software
Whither Generic Recovery from Application Faults? A Fault Study using Open-Source Software
DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Generation of an error set that emulates software faults based on field data
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
TFT: A Software System for Application-Transparent Fault Tolerance
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
How Fail-Stop are Faulty Programs?
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
The Systematic Improvement of Fault Tolerance in the Rio File Cache
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
Reliable hardware-software architecture
Proceedings of the international conference on Reliable software
Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel
HOTOS '01 Proceedings of the Eighth Workshop on Hot Topics in Operating Systems
Recovery Guarantees for General Multi-Tier Applications
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Making a Case for Efficient Supercomputing
Queue - Power Management
Improving the reliability of commodity operating systems
ACM Transactions on Computer Systems (TOCS)
Exploring failure transparency and the limits of generic recovery
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Tool support for new test criteria on embedded systems: Justitia
Proceedings of the 2nd international conference on Ubiquitous information management and communication
Live migration of direct-access devices
ACM SIGOPS Operating Systems Review
Fast byte-granularity software fault isolation
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Tolerating hardware device failures in software
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Surviving sensor network software faults
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
The cake is a lie: privilege rings as a policy resource
Proceedings of the 1st ACM workshop on Virtual machine security
Reverse engineering of binary device drivers with RevNIC
Proceedings of the 5th European conference on Computer systems
Otherworld: giving applications a chance to survive OS kernel crashes
Proceedings of the 5th European conference on Computer systems
Design of fault tolerant system based on runtime behavior tracing
ICACT'10 Proceedings of the 12th international conference on Advanced communication technology
Reverse-engineering drivers for safety and portability
HotDep'08 Proceedings of the Fourth conference on Hot topics in system dependability
Device driver safety through a reference validation mechanism
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Tolerating malicious device drivers in Linux
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Testing closed-source binary device drivers with DDT
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Live migration of direct-access devices
WIOV'08 Proceedings of the First conference on I/O virtualization
HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
Faults in linux: ten years later
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
A comparative experimental study of software rejuvenation overhead
Performance Evaluation
Comprehending performance from real-world execution traces: a device-driver case
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Guardrail: a high fidelity approach to protecting hardware devices from buggy drivers
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Hi-index | 0.00 |
This article presents a new mechanism that enables applications to run correctly when device drivers fail. Because device drivers are the principal failing component in most systems, reducing driver-induced failures greatly improves overall reliability. Earlier work has shown that an operating system can survive driver failures [Swift et al. 2005], but the applications that depend on them cannot. Thus, while operating system reliability was greatly improved, application reliability generally was not.To remedy this situation, we introduce a new operating system mechanism called a shadow driver. A shadow driver monitors device drivers and transparently recovers from driver failures. Moreover, it assumes the role of the failed driver during recovery. In this way, applications using the failed driver, as well as the kernel itself, continue to function as expected.We implemented shadow drivers for the Linux operating system and tested them on over a dozen device drivers. Our results show that applications and the OS can indeed survive the failure of a variety of device drivers. Moreover, shadow drivers impose minimal performance overhead. Lastly, they can be introduced with only modest changes to the OS kernel and with no changes at all to existing device drivers.