Handling cascading failures: the case for topology-aware fault-tolerance

Authors:
Soila Pertet;Priya Narasimhan
Affiliations:
Electrical & Computer Engineering Department, Carnegie Mellon University, Pittsburgh, PA;Electrical & Computer Engineering Department, Carnegie Mellon University, Pittsburgh, PA
Venue:
HotDep'05 Proceedings of the First conference on Hot topics in system dependability
Year:
2005

Citing 4
Cited 2

Automatic Failure-Path Inference: A Generic Introspection Technique for Internet Applications

WIAPP '03 Proceedings of the The Third IEEE Workshop on Internet Applications
Dependable Initialization of Large-Scale Distributed Software

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
An integrated experimental environment for distributed systems and networks

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Quality Attributes in Wireless Sensor Networks

SEUS '05 Proceedings of the Third IEEE Workshop on Software Technologies for Future Embedded and Ubiquitous Systems

Impact of random failures and attacks on Poisson and power-law random networks

ACM Computing Surveys (CSUR)
Failure recovery: when the cure is worse than the disease

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large distributed systems contain multiple components that can interact in sometimes unforeseen and complicated ways; this emergent "vulnerability of complexity" increases the likelihood of cascading failures that might result in widespread disruption. Our research explores whether we can exploit the knowledge of the system's topology, the application's interconnections and the application's normal fault-free behavior to build proactive fault-tolerance techniques that could curb the spread of cascading failures and enable faster system-wide recovery. We seek to characterize what the topology knowledge would entail, quantify the benefits of our approach and understand the associated trade offs.