Cheap recovery: a key to self-managing state

  • Authors:
  • Andrew C. Huang;Armando Fox

  • Affiliations:
  • Stanford University, Stanford, CA;Stanford University, Stanford, CA

  • Venue:
  • ACM Transactions on Storage (TOS)
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Cluster hash tables (CHTs) are key components of many large-scale Internet services due to their highly-scalable performance and the prevalence of the type of data they store. Another advantage of CHTs is that they can be designed to be as self-managing as a cluster of stateless servers. One key to achieving this extreme manageability is reboot-based recovery that is predictably fast and has modest impact on system performance and availability. This "cheap" recovery mechanism simplifies management in two ways. First, it simplifies failure detection by lowering the cost of acting on false positives. This enables one to use statistical techniques to turn hard-to-catch failures, such as node degradation, into failure, followed by recovery. Second, cheap recovery simplifies capacity planning by recasting repartitioning as failure plus recovery to achieve zero-downtime incremental scaling. These low-cost recovery and scaling mechanisms make it possible for the system to be continuously self-adjusting, a key property of self-managing systems.