Improving availability with recursive microreboots: a soft-state system case study

Authors:
George Candea;James Cutler;Armando Fox
Affiliations:
Stanford University, Stanford, CA;Stanford University, Stanford, CA;Stanford University, Stanford, CA
Venue:
Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
Year:
2004

Citing 60
Cited 26

The packer filter: an efficient mechanism for user-level network code

SOSP '87 Proceedings of the eleventh ACM Symposium on Operating systems principles
The programming language Oberon

Software—Practice & Experience
Leases: an efficient fault-tolerant mechanism for distributed file cache consistency

SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
Programming perl

Programming perl
The design and implementation of a log-structured file system

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
The synchronization of periodic routing messages

IEEE/ACM Transactions on Networking (TON)
The mythical man-month (anniversary ed.)

The mythical man-month (anniversary ed.)
A reliable multicast framework for light-weight sessions and application level framing

SIGCOMM '95 Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication
Hive: fault containment for shared-memory multiprocessors

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Fault-tolerant computer system design

Fault-tolerant computer system design
Minimizing completion time of a program by checkpointing and rejuvenation

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
The Rio file cache: surviving operating system crashes

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
End-to-end routing behavior in the Internet

Conference proceedings on Applications, technologies, architectures, and protocols for computer communications
Disco: running commodity operating systems on scalable multiprocessors

ACM Transactions on Computer Systems (TOCS)
The performance of μ-kernel-based systems

Proceedings of the sixteenth ACM symposium on Operating systems principles
Cluster-based scalable network services

Proceedings of the sixteenth ACM symposium on Operating systems principles
Free transactions with Rio Vista

Proceedings of the sixteenth ACM symposium on Operating systems principles
Reliable computer systems (3rd ed.): design and evaluation

Reliable computer systems (3rd ed.): design and evaluation
Practical Byzantine fault tolerance

OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
Chameleon: A Software Infrastructure for Adaptive Fault Tolerance

IEEE Transactions on Parallel and Distributed Systems
A model, analysis, and protocol framework for soft state-based communication

Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication
SIMULA: an ALGOL-based simulation language

Communications of the ACM
Fast-Start: quick fault recovery in oracle

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Information and control in gray-box systems

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
An empirical study of operating systems errors

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Multitasking without comprimise: a virtual machine evolution

OOPSLA '01 Proceedings of the 16th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Software Fault Tolerance

Software Fault Tolerance
Verification and Validation of Modern Software Systems

Verification and Validation of Modern Software Systems
Increasing relevance of memory hardware errors: a case for recoverable programming models

EW 9 Proceedings of the 9th workshop on ACM SIGOPS European workshop: beyond the PC: new challenges for the operating system
Computer architecture: a quantitative approach

Computer architecture: a quantitative approach
Transaction Processing: Concepts and Techniques

Transaction Processing: Concepts and Techniques
Lessons from Giant-Scale Services

IEEE Internet Computing
Concurrent Error Detection Using Watchdog Processors-A Survey

IEEE Transactions on Computers
Proceedings of the 8th International Symposium on Static Analysis

SAS '01 Proceedings of the 8th International Symposium on Static Analysis
The Design of the POSTGRES Storage System

VLDB '87 Proceedings of the 13th International Conference on Very Large Data Bases
Heartbeat: A Timeout-Free Failure Detector for Quiescent Reliable Communication

WDAG '97 Proceedings of the 11th International Workshop on Distributed Algorithms
Notes on Data Base Operating Systems

Operating Systems, An Advanced Course
Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
An Experimental Evaluation of the REE SIFT Environment for Spaceborne Applications

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
A NonStop kernel

SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
Recovery blocks in action: A system supporting high reliability

ICSE '76 Proceedings of the 2nd international conference on Software engineering
Self-Monitoring and Self-Adapting Operating Systems

HOTOS '97 Proceedings of the 6th Workshop on Hot Topics in Operating Systems (HotOS-VI)
Automatic Failure-Path Inference: A Generic Introspection Technique for Internet Applications

WIAPP '03 Proceedings of the The Third IEEE Workshop on Internet Applications
A Methodology for Detection and Estimation of Software Aging

ISSRE '98 Proceedings of the The Ninth International Symposium on Software Reliability Engineering
Software Rejuvenation: Analysis, Module and Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Checkpointing and Its Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Fail-Stutter Fault Tolerance

HOTOS '01 Proceedings of the Eighth Workshop on Hot Topics in Operating Systems
Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel

HOTOS '01 Proceedings of the Eighth Workshop on Hot Topics in Operating Systems
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,

Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Scale and performance in the Denali isolation kernel

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
An analysis of internet content delivery systems

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Luna: a flexible Java protection system

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Undo for operators: building an undoable e-mail store

ATEC '03 Proceedings of the annual conference on USENIX Annual Technical Conference
Crash-only software

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Exploring failure transparency and the limits of generic recovery

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Scalable, distributed data structures for internet service construction

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Why do internet services fail, and what can be done about it?

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Berkeley DB

ATEC '99 Proceedings of the annual conference on USENIX Annual Technical Conference
IBM S/390 parallel enterprise server G5 fault tolerance: a historical perspective

IBM Journal of Research and Development
Fibre Channel A Comprehensive Introduction

Fibre Channel A Comprehensive Introduction

Destructive Transaction: Human-Oriented Cluster System Management Mechanism

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
A new approach to real-time checkpointing

Proceedings of the 2nd international conference on Virtual execution environments
Autonomic configuration and recovery in a mobile agent-based distributed event monitoring system: Research Articles

Software—Practice & Experience
Crash-only software

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Emstar: A software environment for developing and deploying heterogeneous sensor-actuator networks

ACM Transactions on Sensor Networks (TOSN)
Kernel support for zero-loss Internet service restart

Software—Practice & Experience
Software architecture reliability analysis using failure scenarios

Journal of Systems and Software
Debugging debugged, a metaphysical manifesto of systems integration

ACM SIGSOFT Software Engineering Notes
Enhancing storage system availability on multi-core architectures with recovery-conscious scheduling

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Extending Symmetry Reduction by Exploiting System Architecture

VMCAI '09 Proceedings of the 10th International Conference on Verification, Model Checking, and Abstract Interpretation
Isolation points: Creating performance-robust enterprise systems

ACM Transactions on Autonomous and Adaptive Systems (TAAS)
Self-adaptive software: Landscape and research challenges

ACM Transactions on Autonomous and Adaptive Systems (TAAS)
Toward automatic policy refinement in repair services for large distributed systems

ACM SIGOPS Operating Systems Review
Architecting dependable systems with proactive fault management

Architecting dependable systems VII
A framework for evaluating quality-driven self-adaptive software systems

Proceedings of the 6th International Symposium on Software Engineering for Adaptive and Self-Managing Systems
Simulation-based analysis of middleware service impact on system reliability: Experiment on Java application server

Journal of Systems and Software
Detecting failures in distributed systems with the Falcon spy network

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
I-queue: smart queues for service management

ICSOC'06 Proceedings of the 4th international conference on Service-Oriented Computing
Policy-Driven configuration and management of agent based distributed systems

Software Engineering for Multi-Agent Systems IV
Towards a goal-driven approach to action selection in self-adaptive software

Software—Practice & Experience
To increase survivability with software rejuvenation by having dual base station in WSN environment

ISPA'07 Proceedings of the 2007 international conference on Frontiers of High Performance Computing and Networking
Fault tolerance: case study

Proceedings of the Second International Conference on Computational Science, Engineering and Information Technology
Report on the fourth workshop on hot topics in software upgrades (HotSWUp 2012)

ACM SIGOPS Operating Systems Review
Optimizing decomposition of software architecture for local recovery

Software Quality Control
Improving availability in distributed systems with failure informers

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
SeaDoc: a self-adaptive document link provision system for framework extension tasks

Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Even after decades of software engineering research, complex computer systems still fail. This paper makes the case for increasing research emphasis on dependability and, specifically, on improving availability by reducing time-to-recover.All software fails at some point, so systems must be able to recover from failures. Recovery itself can fail too, so systems must know how to intelligently retry their recovery. We present here a recursive approach, in which a minimal subset of components is recovered first; if that does not work, progressively larger subsets are recovered. Our domain of interest is Internet services; these systems experience primarily transient or intermittent failures, that can typically be resolved by rebooting. Conceding that failure-free software will continue eluding us for years to come, we undertake a systematic investigation of fine grain component-level restarts, microreboots, as high availability medicine. Building and maintaining an accurate model of large Internet systems is nearly impossible, due to their scale and constantly evolving nature, so we take an application-generic approach, that relies on empirical observations to manage recovery.We apply recursive microreboots to Mercury, a commercial off-the-shelf (COTS)-based satellite ground station that is based on an lnternet service platform. Mercury has been in successful operation for over 3 years. From our experience with Mercury, we draw design guidelines and lessons for the application of recursive microreboots to other software systems. We also present a set of guidelines for building systems amenable to recursive reboots, known as "crash-only software systems."