Cheap recovery: a key to self-managing state

Authors:
Andrew C. Huang;Armando Fox
Affiliations:
Stanford University, Stanford, CA;Stanford University, Stanford, CA
Venue:
ACM Transactions on Storage (TOS)
Year:
2005

Citing 30
Cited 2

Consistency in a partitioned network: a survey

ACM Computing Surveys (CSUR)
Correct memory operation of cache-based multiprocessors

ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Replication in the harp file system

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
Disconnected operation in the Coda File System

ACM Transactions on Computer Systems (TOCS)
RAID: high-performance, reliable secondary storage

ACM Computing Surveys (CSUR)
Cluster-based scalable network services

Proceedings of the sixteenth ACM symposium on Operating systems principles
Frangipani: a scalable distributed file system

Proceedings of the sixteenth ACM symposium on Operating systems principles
Manageability, availability and performance in Porcupine: a highly scalable, cluster-based mail service

Proceedings of the seventeenth ACM symposium on Operating systems principles
A Majority consensus approach to concurrency control for multiple copy databases

ACM Transactions on Database Systems (TODS)
The Ninja architecture for robust Internet-scale systems and services373423

Computer Networks: The International Journal of Computer and Telecommunications Networking - pervasive computing
Session guarantees for weakly consistent replicated data

PDIS '94 Proceedings of the third international conference on on Parallel and distributed information systems
Probability and statistics with reliability, queuing and computer science applications

Probability and statistics with reliability, queuing and computer science applications
Lessons from Giant-Scale Services

IEEE Internet Computing
Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Finding surprising patterns in a time series database in linear time and space

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Weighted voting for replicated data

SOSP '79 Proceedings of the seventh ACM symposium on Operating systems principles
Secure and Scalable Replication in Phalanx

SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
A Methodology for Detection and Estimation of Software Aging

ISSRE '98 Proceedings of the The Ninth International Symposium on Software Reliability Engineering
A Conversation with Jim Gray

Queue - Storage
Software Rejuvenation: Analysis, Module and Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Fail-Stutter Fault Tolerance

HOTOS '01 Proceedings of the Eighth Workshop on Hot Topics in Operating Systems
Self-adjusting quorum systems for byzantine fault tolerance

Self-adjusting quorum systems for byzantine fault tolerance
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
RepStore: A Self-Managing and Self-Tuning Storage Backend with Smart Bricks

ICAC '04 Proceedings of the First International Conference on Autonomic Computing
Crash-only software

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
FAB: enterprise storage systems on a shoestring

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Session state: beyond soft state

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Design and evaluation of a continuous consistency model for replicated services

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Scalable, distributed data structures for internet service construction

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Berkeley DB

ATEC '99 Proceedings of the annual conference on USENIX Annual Technical Conference

Towards autonomic computing: a new self-management method

AICI'11 Proceedings of the Third international conference on Artificial intelligence and computational intelligence - Volume Part I
SEERDIS: a DHT-based resource indexing and discovery scheme for the data center

Proceedings of the 19th High Performance Computing Symposia

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cluster hash tables (CHTs) are key components of many large-scale Internet services due to their highly-scalable performance and the prevalence of the type of data they store. Another advantage of CHTs is that they can be designed to be as self-managing as a cluster of stateless servers. One key to achieving this extreme manageability is reboot-based recovery that is predictably fast and has modest impact on system performance and availability. This "cheap" recovery mechanism simplifies management in two ways. First, it simplifies failure detection by lowering the cost of acting on false positives. This enables one to use statistical techniques to turn hard-to-catch failures, such as node degradation, into failure, followed by recovery. Second, cheap recovery simplifies capacity planning by recasting repartitioning as failure plus recovery to achieve zero-downtime incremental scaling. These low-cost recovery and scaling mechanisms make it possible for the system to be continuously self-adjusting, a key property of self-managing systems.