Non-volatile memory for fast, reliable file systems
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
An empirical study of operating systems errors
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
An Evaluation of Starburst's Memory Resident Storage Component
IEEE Transactions on Knowledge and Data Engineering
The Systematic Improvement of Fault Tolerance in the Rio File Cache
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
The Impact of Recovery Mechanisms on the Likelihood of Saving Corrupted State
ISSRE '02 Proceedings of the 13th International Symposium on Software Reliability Engineering
Recovery Oriented Computing: A New Research Agenda for a New Century
HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Design and implementation of reliable main memory
Design and implementation of reliable main memory
Basic Concepts and Taxonomy of Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing
Improving the reliability of commodity operating systems
ACM Transactions on Computer Systems (TOCS)
Remote Repair of Operating System State Using Backdoors
ICAC '04 Proceedings of the First International Conference on Autonomic Computing
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
ACM Transactions on Computer Systems (TOCS)
Debugging operating systems with time-traveling virtual machines
ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Flashback: a lightweight extension for rollback and deterministic replay for software debugging
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Exploring failure transparency and the limits of generic recovery
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Microreboot — A technique for cheap recovery
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Transparent checkpoint-restart of multiple processes on commodity operating systems
ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Surviving sensor network software faults
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
The evolution of the MVS operating system
IBM Journal of Research and Development
"Otherworld": giving applications a chance to survive OS kernel crashes
HotDep'08 Proceedings of the Fourth conference on Hot topics in system dependability
CuriOS: improving reliability through operating system structure
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Reorganizing UNIX for reliability
ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
Faults in linux: ten years later
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
ReHype: enabling VM survival across hypervisor failures
Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Fast and correct performance recovery of operating systems using a virtual machine monitor
Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
BareBox: efficient malware analysis on bare-metal
Proceedings of the 27th Annual Computer Security Applications Conference
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Is Linux kernel oops useful or not?
HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
Safe and automatic live update for operating systems
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Traveling forward in time to newer operating systems using ShadowReboot
Proceedings of the 9th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Hi-index | 0.01 |
The default behavior of all commodity operating systems today is to restart the system when a critical error is encountered in the kernel. This terminates all running applications with an attendant loss of "work in progress" that is nonpersistent. Otherworld is a mechanism that microreboots the operating system kernel when a critical error is encountered in the kernel, and it does so without clobbering the state of the running applications. After the kernel microreboot, Otherworld attempts to resurrect the applications that were running at the time of failure. It does so by restoring the application memory spaces, open files and other resources. In the default case it then continues executing the processes from the point at which they were interrupted by the failure. Optionally, applications can have user-level recovery procedures registered with the kernel, in which case Otherworld passes control to these procedures after having restored their process state. Recovery procedures might check the integrity of application data and restore resources Otherworld was not able to restore. We implemented Otherworld in Linux, but we believe that the technique can be applied to all commodity operating systems. In an extensive set of experiments on real-world applications (MySQL, Apache/PHP, Joe, vi), we show that Otherworld is capable of successfully microrebooting the kernel and restoring the applications in over 97% of the cases. In the default case, Otherworld adds zero overhead to normal execution. In an enhanced mode, Otherworld can provide extra application memory protection with overhead of between 4% and 12%.