Failure recovery: when the cure is worse than the disease

Authors:
Zhenyu Guo;Sean McDirmid;Mao Yang;Li Zhuang;Pu Zhang;Yingwei Luo;Tom Bergan;Peter Bodik;Madan Musuvathi;Zheng Zhang;Lidong Zhou
Affiliations:
Microsoft Research;Microsoft Research;Microsoft Research;Microsoft Research;Microsoft Research and Peking University;Peking University;Microsoft Research and University of Washington;Microsoft Research;Microsoft Research;Microsoft Research;Microsoft Research
Venue:
HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Year:
2013

Citing 14
Cited 1

Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
The part-time parliament

ACM Transactions on Computer Systems (TOCS)
Lessons from Giant-Scale Services

IEEE Internet Computing
Robustness in Complex Systems

HOTOS '01 Proceedings of the Eighth Workshop on Hot Topics in Operating Systems
Emergent (mis)behavior vs. complex software systems

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Why do internet services fail, and what can be done about it?

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Paxos made live: an engineering perspective

Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
D3S: debugging deployed distributed systems

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
MODIST: transparent model checking of unmodified distributed systems

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Handling cascading failures: the case for topology-aware fault-tolerance

HotDep'05 Proceedings of the First conference on Hot topics in system dependability
Windows Azure Storage: a highly available cloud storage service with strong consistency

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
TimeStream: reliable stream computation in the cloud

Proceedings of the 8th ACM European Conference on Computer Systems
Improving availability in distributed systems with failure informers

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation

On fault resilience of OpenStack

Proceedings of the 4th annual Symposium on Cloud Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cloud services inevitably fail: machines lose power, networks become disconnected, pesky software bugs cause sporadic crashes, and so on. Unfortunately, failure recovery itself is often faulty; e.g. recovery can accidentally recursively replicate small failures to other machines until the entire cloud service fails in a catastrophic outage, amplifying a small cold into a contagious deadly plague! We propose that failure recovery should be engineered foremost according to the maxim of primum non nocere, that it "does no harm." Accordingly, we must consider the system holistically when failure occurs and recover only when observed activity safely allows for it.