Choices (class hierarchical open interface for custom embedded systems)
ACM SIGOPS Operating Systems Review
The Rio file cache: surviving operating system crashes
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
An empirical study of operating systems errors
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Software Fault Tolerance: A Tutorial
Software Fault Tolerance: A Tutorial
Improving the reliability of commodity operating systems
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
IEEE Transactions on Software Engineering
An Architectural Framework for Providing Reliability and Security Support
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Recovering Internet Service Sessions from Operating System Failures
IEEE Internet Computing
An OS-level Framework for Providing Application-Aware Reliability
PRDC '06 Proceedings of the 12th Pacific Rim International Symposium on Dependable Computing
QEMU, a fast and portable dynamic translator
ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Microreboot — A technique for cheap recovery
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Enhancing server availability and security through failure-oblivious computing
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
SafeDrive: safe and recoverable extensions using language-based techniques
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Exception handling in the choices operating system
Advanced Topics in Exception Handling Techniques
Improving dependability by revisiting operating system design
HotDep'07 Proceedings of the 3rd workshop on on Hot Topics in System Dependability
BootJacker: compromising computers using forced restarts
Proceedings of the 15th ACM conference on Computer and communications security
CuriOS: improving reliability through operating system structure
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Forenscope: a framework for live forensics
Proceedings of the 26th Annual Computer Security Applications Conference
OS-level hang detection in complex software systems
International Journal of Critical Computer-Based Systems
Hi-index | 0.00 |
Operating system lockup errors can render a computer unusable by preventing the execution other programs. Watchdog timers can be used to recover from a lockup by resetting the processor and rebooting the system when a lockup is detected. This results in a loss of unsaved data in running programs. Based on the observation that volatile memory is not affected when a processor a reset occurs, we present an approach to recover from a watchdog reset with minimal or zero loss of application state. We study the resolution of lockup conditions using thread termination and using exception dispatch. Thread termination can still result in a usable system and is already used as a recovery strategy for other errors in Linux. Using exceptions allows developers to write code to handle a lockup within the erroneous thread and attempt application transparent recovery. Fault injection experiments show that a significant percentage of lockups can be recovered by thread termination. Exception handling further improves the recoverability of the operating system.