Improving availability with recursive microreboots: a soft-state system case study

  • Authors:
  • George Candea;James Cutler;Armando Fox

  • Affiliations:
  • Stanford University, Stanford, CA;Stanford University, Stanford, CA;Stanford University, Stanford, CA

  • Venue:
  • Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Even after decades of software engineering research, complex computer systems still fail. This paper makes the case for increasing research emphasis on dependability and, specifically, on improving availability by reducing time-to-recover.All software fails at some point, so systems must be able to recover from failures. Recovery itself can fail too, so systems must know how to intelligently retry their recovery. We present here a recursive approach, in which a minimal subset of components is recovered first; if that does not work, progressively larger subsets are recovered. Our domain of interest is Internet services; these systems experience primarily transient or intermittent failures, that can typically be resolved by rebooting. Conceding that failure-free software will continue eluding us for years to come, we undertake a systematic investigation of fine grain component-level restarts, microreboots, as high availability medicine. Building and maintaining an accurate model of large Internet systems is nearly impossible, due to their scale and constantly evolving nature, so we take an application-generic approach, that relies on empirical observations to manage recovery.We apply recursive microreboots to Mercury, a commercial off-the-shelf (COTS)-based satellite ground station that is based on an lnternet service platform. Mercury has been in successful operation for over 3 years. From our experience with Mercury, we draw design guidelines and lessons for the application of recursive microreboots to other software systems. We also present a set of guidelines for building systems amenable to recursive reboots, known as "crash-only software systems."