Microreboot — A technique for cheap recovery

Authors:
George Candea;Shinichi Kawamoto;Yuichi Fujiki;Greg Friedman;Armando Fox
Affiliations:
Computer Systems Lab, Stanford University;Computer Systems Lab, Stanford University;Computer Systems Lab, Stanford University;Computer Systems Lab, Stanford University;Computer Systems Lab, Stanford University
Venue:
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Year:
2004

Citing 15
Cited 116

Toward real microkernels

Communications of the ACM
Performance and scalability of EJB applications

OOPSLA '02 Proceedings of the 17th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Lessons from Giant-Scale Services

IEEE Internet Computing
Sources of Failure in the Public Switched Telephone Network

Computer
Micro-Checkpointing: Checkpointing for Multithreaded Applications

IOLTW '00 Proceedings of the 6th IEEE International On-Line Testing Workshop (IOLTW)
Software Rejuvenation: Analysis, Module and Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel

HOTOS '01 Proceedings of the Eighth Workshop on Hot Topics in Operating Systems
Software Reliability from the Customer View

Computer
Terra: a virtual machine-based platform for trusted computing

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Improving the reliability of commodity operating systems

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Farsite: federated, available, and reliable storage for an incompletely trusted environment

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Failure Diagnosis Using Decision Trees

ICAC '04 Proceedings of the First International Conference on Autonomic Computing
Crash-only software

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Session state: beyond soft state

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Exploring failure transparency and the limits of generic recovery

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4

Recovery-Oriented Computing: Building Multitier Dependability

Computer
Recovering Internet Service Sessions from Operating System Failures

IEEE Internet Computing
Destructive Transaction: Human-Oriented Cluster System Management Mechanism

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Research challenges of autonomic computing

Proceedings of the 27th international conference on Software engineering
Exploring the acceptability envelope

OOPSLA '05 Companion to the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
IRON file systems

Proceedings of the twentieth ACM symposium on Operating systems principles
Rx: treating bugs as allergies---a safe method to survive software failures

Proceedings of the twentieth ACM symposium on Operating systems principles
Proactive operating system recovery

Proceedings of the twentieth ACM symposium on Operating systems principles
The costs and limits of availability for replicated services

ACM Transactions on Computer Systems (TOCS)
An Integrated Framework for Dependable and Revivable Architectures Using Multicore Processors

Proceedings of the 33rd annual international symposium on Computer Architecture
Stabilizers: a modular checkpointing abstraction for concurrent functional programs

Proceedings of the eleventh ACM SIGPLAN international conference on Functional programming
EOS2: unstoppable stateful PHP

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Declarative failure recovery for sensor networks

Proceedings of the 6th international conference on Aspect-oriented software development
Reboot-based self-healing strategies for service-oriented systems

ACM-SE 45 Proceedings of the 45th annual southeast regional conference
Falling off the cliff: when systems go nonlinear

HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
The collective: a cache-based system management architecture

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
OPUS: online patches and updates for security

SSYM'05 Proceedings of the 14th conference on USENIX Security Symposium - Volume 14
Detecting performance anomalies in global applications

WORLDS'05 Proceedings of the 2nd conference on Real, Large Distributed Systems - Volume 2
Modular Checkpointing for Atomicity

Electronic Notes in Theoretical Computer Science (ENTCS)
Sealing OS processes to improve dependability and safety

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Rx: Treating bugs as allergies—a safe method to survive software failures

ACM Transactions on Computer Systems (TOCS)
An approach to detecting failures automatically

Fourth international workshop on Software quality assurance: in conjunction with the 6th ESEC/FSE joint meeting
Towards design for self-healing

Fourth international workshop on Software quality assurance: in conjunction with the 6th ESEC/FSE joint meeting
SafeDrive: safe and recoverable extensions using language-based techniques

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Improving dependability by revisiting operating system design

HotDep'07 Proceedings of the 3rd workshop on on Hot Topics in System Dependability
Reengineering J2EE Servers for Automated Management in Distributed Environments

IEEE Distributed Systems Online
Controlled, systematic, and efficient code replacement for running java programs

Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Serial experiments online

ACM SIGCOMM Computer Communication Review
Recoverable class loaders for a fast restart of Java applications

Proceedings of the 1st international conference on MOBILe Wireless MiddleWARE, Operating Systems, and Applications
Exploring event correlation for failure prediction in coalitions of clusters

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Exploring recovery from operating system lockups

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Enhancing storage system availability on multi-core architectures with recovery-conscious scheduling

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
EIO: error handling is occasionally correct

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Self-healing by means of automatic workarounds

Proceedings of the 2008 international workshop on Software engineering for adaptive and self-managing systems
Runtime failure detection

Companion of the 30th international conference on Software engineering
LeakSurvivor: towards safely tolerating memory leaks for garbage-collected languages

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
An Operating System Architecture for Future Information Appliances

SEUS '08 Proceedings of the 6th IFIP WG 10.2 international workshop on Software Technologies for Embedded and Ubiquitous Systems
A self-stabilizing autonomic recoverer for eventual Byzantine software

Journal of Systems and Software
Enforcing authorization policies using transactional memory introspection

Proceedings of the 15th ACM conference on Computer and communications security
Automatic workarounds as failure recoveries

Proceedings of the 2008 Foundations of Software Engineering Doctoral Symposium
Recoverable class loaders for a fast restart of Java applications

Mobile Networks and Applications
Recovery domains: an organizing principle for recoverable operating systems

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Utility-driven proactive management of availability in enterprise-scale information flows

Proceedings of the ACM/IFIP/USENIX 2006 International Conference on Middleware
First-aid: surviving and preventing memory management bugs during production runs

Proceedings of the 4th ACM European conference on Computer systems
A systematic approach to system state restoration during storage controller micro-recovery

FAST '09 Proccedings of the 7th conference on File and storage technologies
An adaptation framework enabling resource-efficient operation of software systems

Proceedings of the Warm Up Workshop for ACM/IEEE ICSE 2010
Building a self-healing embedded system in a multi-OS environment

Proceedings of the 2009 ACM symposium on Applied Computing
Self-recovery in server programs

Proceedings of the 2009 international symposium on Memory management
In-field healing of integration problems with COTS components

ICSE '09 Proceedings of the 31st International Conference on Software Engineering
Automatic Generation of Runtime Failure Detectors from Property Templates

Software Engineering for Self-Adaptive Systems
Using Filtered Cartesian Flattening and Microrebooting to Build Enterprise Applications with Self-adaptive Healing

Software Engineering for Self-Adaptive Systems
Towards Dynamic Component Isolation in a Service Oriented Platform

CBSE '09 Proceedings of the 12th International Symposium on Component-Based Software Engineering
Surviving sensor network software faults

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Software rejuvenation in embedded systems

Journal of Automata, Languages and Combinatorics
A survey of online failure prediction methods

ACM Computing Surveys (CSUR)
Optimizing crash dump in virtualized environments

Proceedings of the 6th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Otherworld: giving applications a chance to survive OS kernel crashes

Proceedings of the 5th European conference on Computer systems
Business process monitoring for dependability

Architecting dependable systems IV
Monalytics: online monitoring and analytics for managing large scale data centers

Proceedings of the 7th international conference on Autonomic computing
On the potential of software rejuvenation for long-running sensor network deployments

Proceedings of the 2010 ICSE Workshop on Software Engineering for Sensor Network Applications
Proactive obfuscation

ACM Transactions on Computer Systems (TOCS)
Adaptive system anomaly prediction for large-scale hosting infrastructures

Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
Membrane: Operating system support for restartable file systems

ACM Transactions on Storage (TOS)
Recovery scopes, recovery groups, and fine-grained recovery in enterprise storage controllers with multi-core processors

IBM Journal of Research and Development
Lightweight checkpointing for concurrent ml

Journal of Functional Programming
Membrane: operating system support for restartable file systems

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
CuriOS: improving reliability through operating system structure

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Deadlock immunity: enabling systems to defend against deadlocks

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Fido: fast inter-virtual-machine communication for enterprise appliances

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Runtime verification in context: can optimizing error detection improve fault diagnosis?

RV'10 Proceedings of the First international conference on Runtime verification
ReHype: enabling VM survival across hypervisor failures

Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Fast and correct performance recovery of operating systems using a virtual machine monitor

Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Applying dependability aspects on top of "aspectized" software layers

Proceedings of the tenth international conference on Aspect-oriented software development
Refuse to crash with Re-FUSE

Proceedings of the sixth conference on Computer systems
Operating system implications of fast, cheap, non-volatile memory

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Making programs forget: enforcing lifetime for sensitive data

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Quarantine: fault tolerance for concurrent servers with data-driven selective isolation

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Using feature locality: can we leverage history to avoid failures during reconfiguration?

Proceedings of the 8th workshop on Assurances for self-adaptive systems
Architecture-based run-time fault diagnosis

ECSA'11 Proceedings of the 5th European conference on Software architecture
Breaking up is hard to do: security and functionality in a commodity hypervisor

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Detecting failures in distributed systems with the Falcon spy network

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Detecting and surviving data races using complementary schedules

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Unstoppable stateful PHP web services

WISE'06 Proceedings of the 7th international conference on Web Information Systems
Utility-driven proactive management of availability in enterprise-scale information flows

Middleware'06 Proceedings of the 7th ACM/IFIP/USENIX international conference on Middleware
Autonomic agents for survivable security systems

EUC'05 Proceedings of the 2005 international conference on Embedded and Ubiquitous Computing
Exception handling in the choices operating system

Advanced Topics in Exception Handling Techniques
Modeling and cost analysis of nested software rejuvenation policy

ICNC'05 Proceedings of the First international conference on Advances in Natural Computation - Volume Part III
Whole-system persistence

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
What to do when things go wrong: recovery in complex (computer) systems

Proceedings of the 11th annual international conference on Aspect-oriented Software Development Companion
A self-healing component sandbox for untrustworthy third party code execution

CBSE'10 Proceedings of the 13th international conference on Component-Based Software Engineering
Can dynamic provisioning and rejuvenation systems coexist in peace?

DSOM'05 Proceedings of the 16th IFIP/IEEE Ambient Networks international conference on Distributed Systems: operations and Management
Coding-error based defects in enterprise resource planning software: Prevention, discovery, elimination and mitigation

Journal of Systems and Software
Themis: an I/O-efficient MapReduce

Proceedings of the Third ACM Symposium on Cloud Computing
MemRed: towards reliable web applications

Proceedings of the Workshop on Secure and Dependable Middleware for Cloud Monitoring and Management
Towards dependable clients: improving the reliability and availability of the browsers

Proceedings of the 9th Middleware Doctoral Symposium of the 13th ACM/IFIP/USENIX International Middleware Conference
Report on the fourth workshop on hot topics in software upgrades (HotSWUp 2012)

ACM SIGOPS Operating Systems Review
VScope: middleware for troubleshooting time-sensitive data center applications

Proceedings of the 13th International Middleware Conference
Quantitative survivability evaluation of three virtual machine-based server architectures

Journal of Network and Computer Applications
A comparative experimental study of software rejuvenation overhead

Performance Evaluation
ConAir: featherweight concurrency bug recovery via single-threaded idempotent execution

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Optimizing decomposition of software architecture for local recovery

Software Quality Control
Application level ballooning for efficient server consolidation

Proceedings of the 8th ACM European Conference on Computer Systems
Improving availability in distributed systems with failure informers

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Automatic recovery from runtime failures

Proceedings of the 2013 International Conference on Software Engineering
A framework for self-healing software systems

Proceedings of the 2013 International Conference on Software Engineering
Diagnosing architectural run-time failures

Proceedings of the 8th International Symposium on Software Engineering for Adaptive and Self-Managing Systems
Software rejuvenation

Proceedings of the 17th Conference on Pattern Languages of Programs
Autonomous, failure-resilient orchestration of distributed discrete event simulations

Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
VirtuOS: an operating system with kernel virtualization

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Exception handlers for healing component-based systems

ACM Transactions on Software Engineering and Methodology (TOSEM) - Testing, debugging, and error handling, formal methods, lifecycle concerns, evolution and maintenance
A survey of software aging and rejuvenation studies

ACM Journal on Emerging Technologies in Computing Systems (JETC) - Special Issue on Reliability and Device Degradation in Emerging Technologies and Special Issue on WoSAR 2011
Performance troubleshooting in data centers: an annotated bibliography?

ACM SIGOPS Operating Systems Review
Elon: Enabling efficient and long-term reprogramming for wireless sensor networks

ACM Transactions on Embedded Computing Systems (TECS)
Toward predictable, efficient, system-level tolerance of transient faults

ACM SIGBED Review - Special Issue on the 5th Workshop on Adaptive and Reconfigurable Embedded Systems
HARDFS: hardening HDFS with selective and lightweight versioning

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

A significant fraction of software failures in large-scale Internet systems are cured by rebooting, even when the exact failure causes are unknown. However, rebooting can be expensive, causing nontrivial service disruption or downtime even when clusters and failover are employed. In this work we use separation of process recovery from data recovery to enable microrebooting - a fine-grain technique for surgically recovering faulty application components, without disturbing the rest of the application. We evaluate microrebooting in an Internet auction system running on an application server. Microreboots recover most of the same failures as full reboots, but do so an order of magnitude faster and result in an order of magnitude savings in lost work. This cheap form of recovery engenders a new approach to high availability: microreboots can be employed at the slightest hint of failure, prior to node failover in multi-node clusters, even when mistakes in failure detection are likely; failure and recovery can be masked from end users through transparent call-level retries; and systems can be rejuvenated by parts, without ever being shut down.