Handling cascading failures: the case for topology-aware fault-tolerance

  • Authors:
  • Soila Pertet;Priya Narasimhan

  • Affiliations:
  • Electrical & Computer Engineering Department, Carnegie Mellon University, Pittsburgh, PA;Electrical & Computer Engineering Department, Carnegie Mellon University, Pittsburgh, PA

  • Venue:
  • HotDep'05 Proceedings of the First conference on Hot topics in system dependability
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Large distributed systems contain multiple components that can interact in sometimes unforeseen and complicated ways; this emergent "vulnerability of complexity" increases the likelihood of cascading failures that might result in widespread disruption. Our research explores whether we can exploit the knowledge of the system's topology, the application's interconnections and the application's normal fault-free behavior to build proactive fault-tolerance techniques that could curb the spread of cascading failures and enable faster system-wide recovery. We seek to characterize what the topology knowledge would entail, quantify the benefits of our approach and understand the associated trade offs.