Otherworld: giving applications a chance to survive OS kernel crashes

Authors:
Alex Depoutovitch;Michael Stumm
Affiliations:
University of Toronto, Toronto, ON, Canada;University of Toronto, Toronto, ON, Canada
Venue:
Proceedings of the 5th European conference on Computer systems
Year:
2010

Citing 22
Cited 9

Non-volatile memory for fast, reliable file systems

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
An empirical study of operating systems errors

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
An Evaluation of Starburst's Memory Resident Storage Component

IEEE Transactions on Knowledge and Data Engineering
The Systematic Improvement of Fault Tolerance in the Rio File Cache

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
The Impact of Recovery Mechanisms on the Likelihood of Saving Corrupted State

ISSRE '02 Proceedings of the 13th International Symposium on Software Reliability Engineering
Recovery Oriented Computing: A New Research Agenda for a New Century

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Design and implementation of reliable main memory

Design and implementation of reliable main memory
Basic Concepts and Taxonomy of Dependable and Secure Computing

IEEE Transactions on Dependable and Secure Computing
Improving the reliability of commodity operating systems

ACM Transactions on Computer Systems (TOCS)
Remote Repair of Operating System State Using Backdoors

ICAC '04 Proceedings of the First International Conference on Autonomic Computing
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Recovering device drivers

ACM Transactions on Computer Systems (TOCS)
Debugging operating systems with time-traveling virtual machines

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Flashback: a lightweight extension for rollback and deterministic replay for software debugging

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Exploring failure transparency and the limits of generic recovery

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Microreboot — A technique for cheap recovery

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Transparent checkpoint-restart of multiple processes on commodity operating systems

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Surviving sensor network software faults

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
The evolution of the MVS operating system

IBM Journal of Research and Development
"Otherworld": giving applications a chance to survive OS kernel crashes

HotDep'08 Proceedings of the Fourth conference on Hot topics in system dependability
CuriOS: improving reliability through operating system structure

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Reorganizing UNIX for reliability

ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture

We crashed, now what?

HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
Faults in linux: ten years later

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
ReHype: enabling VM survival across hypervisor failures

Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Fast and correct performance recovery of operating systems using a virtual machine monitor

Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
BareBox: efficient malware analysis on bare-metal

Proceedings of the 27th Annual Computer Security Applications Conference
Whole-system persistence

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Is Linux kernel oops useful or not?

HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
Safe and automatic live update for operating systems

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Traveling forward in time to newer operating systems using ShadowReboot

Proceedings of the 9th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments

Quantified Score

Hi-index	0.01

Visualization

Abstract

The default behavior of all commodity operating systems today is to restart the system when a critical error is encountered in the kernel. This terminates all running applications with an attendant loss of "work in progress" that is nonpersistent. Otherworld is a mechanism that microreboots the operating system kernel when a critical error is encountered in the kernel, and it does so without clobbering the state of the running applications. After the kernel microreboot, Otherworld attempts to resurrect the applications that were running at the time of failure. It does so by restoring the application memory spaces, open files and other resources. In the default case it then continues executing the processes from the point at which they were interrupted by the failure. Optionally, applications can have user-level recovery procedures registered with the kernel, in which case Otherworld passes control to these procedures after having restored their process state. Recovery procedures might check the integrity of application data and restore resources Otherworld was not able to restore. We implemented Otherworld in Linux, but we believe that the technique can be applied to all commodity operating systems. In an extensive set of experiments on real-world applications (MySQL, Apache/PHP, Joe, vi), we show that Otherworld is capable of successfully microrebooting the kernel and restoring the applications in over 97% of the cases. In the default case, Otherworld adds zero overhead to normal execution. In an enhanced mode, Otherworld can provide extra application memory protection with overhead of between 4% and 12%.