Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel

Authors:
George Candea;Armando Fox
Affiliations:
-;-
Venue:
HOTOS '01 Proceedings of the Eighth Workshop on Hot Topics in Operating Systems
Year:
2001

Citing 0
Cited 40

ROC-1: Hardware Support for Recovery-Oriented Computing

IEEE Transactions on Computers - Special issue on fault-tolerant embedded systems
Improving cluster availability using workstation validation

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Portability, Extensibility and Robustness in iROS

PERCOM '03 Proceedings of the First IEEE International Conference on Pervasive Computing and Communications
Improving the reliability of commodity operating systems

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Automatic detection and repair of errors in data structures

OOPSLA '03 Proceedings of the 18th annual ACM SIGPLAN conference on Object-oriented programing, systems, languages, and applications
Acceptability-oriented computing

OOPSLA '03 Companion of the 18th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Acceptability-oriented computing

ACM SIGPLAN Notices
Improving availability with recursive microreboots: a soft-state system case study

Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
Self-Healing in Modern Operating Systems

Queue - Programming Languages
Improving the reliability of commodity operating systems

ACM Transactions on Computer Systems (TOCS)
Destructive Transaction: Human-Oriented Cluster System Management Mechanism

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Data structure repair using goal-directed reasoning

Proceedings of the 27th international conference on Software engineering
An online evolutionary approach to developing internet services

EW 10 Proceedings of the 10th workshop on ACM SIGOPS European workshop
Inference and enforcement of data structure consistency specifications

Proceedings of the 2006 international symposium on Software testing and analysis
Conscientious software

Proceedings of the 21st annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and applications
Recovering device drivers

ACM Transactions on Computer Systems (TOCS)
Goal-Directed Reasoning for Specification-Based Data Structure Repair

IEEE Transactions on Software Engineering
Crash-only software

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Using runtime paths for macroanalysis

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Treating bugs as allergies: a safe method for surviving software failures

HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Solaris service management facility: modern system startup and administration

LISA '05 Proceedings of the 19th conference on Large Installation System Administration Conference - Volume 19
Recovering device drivers

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Unmodified device driver reuse and improved system dependability via virtual machines

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Microreboot — A technique for cheap recovery

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Enhancing server availability and security through failure-oblivious computing

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Toward recovery-oriented computing

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Bristlecone: A Language for Robust Software Systems

ECOOP '08 Proceedings of the 22nd European conference on Object-Oriented Programming
Automatically patching errors in deployed software

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Three reliability engineering techniques and their application to evaluating the availability of it systems: an introduction

IBM Systems Journal
"Otherworld": giving applications a chance to survive OS kernel crashes

HotDep'08 Proceedings of the Fourth conference on Hot topics in system dependability
Gestalt: integrated support for implementation and analysis in machine learning

UIST '10 Proceedings of the 23nd annual ACM symposium on User interface software and technology
Lowering the barrier to applying machine learning

UIST '10 Adjunct proceedings of the 23nd annual ACM symposium on User interface software and technology
Recovery tasks: an automated approach to failure recovery

RV'10 Proceedings of the First international conference on Runtime verification
Exception handling in the choices operating system

Advanced Topics in Exception Handling Techniques
Modeling and cost analysis of nested software rejuvenation policy

ICNC'05 Proceedings of the First international conference on Advances in Natural Computation - Volume Part III
A dynamic mechanism for recovering from buffer overflow attacks

ISC'05 Proceedings of the 8th international conference on Information Security
Can dynamic provisioning and rejuvenation systems coexist in peace?

DSOM'05 Proceedings of the 16th IFIP/IEEE Ambient Networks international conference on Distributed Systems: operations and Management
Towards service awareness and autonomic features in a SIP-Enabled network

WAC'05 Proceedings of the Second international IFIP conference on Autonomic Communication
Fault tolerance: case study

Proceedings of the Second International Conference on Computational Science, Engineering and Information Technology
Ideal stabilisation

International Journal of Grid and Utility Computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

Abstract: Even after decades of software engineering research, complex computer systems still fail, primarily due to nondeterministic bugs that are typically resolved by rebooting. Conceding that Heisenbugs will remain a fact of life, we propose a systematic investigation of restarts as "high availability medicine." In this paper we show how recursive restartability (RR) - the ability of a system to gracefully tolerate restarts at multiple levels - improves fault tolerance, reduces time-to-repair, and enables system designers to build flexible, highly available software infrastructures. Using several examples of widely deployed software systems, we identify properties that are required of RR systems and outline an agenda for turning the recursive restartability philosophy into a practical software structuring tool. Finally, we describe infrastructural support for RR systems, along with initial ideas on how to analyze and benchmark such systems.