Fingerpointing correlated failures in replicated systems

  • Authors:
  • Soila Pertet;Rajeev Gandhi;Priya Narasimhan

  • Affiliations:
  • Electrical & Computer Engineering Department, Carnegie Mellon University, Pittsburgh, PA;Electrical & Computer Engineering Department, Carnegie Mellon University, Pittsburgh, PA;Electrical & Computer Engineering Department, Carnegie Mellon University, Pittsburgh, PA

  • Venue:
  • SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Replicated systems are often hosted over underlying group communication protocols that provide totally ordered, reliable delivery of messages. In the face of a performance problem at a single node, these protocols can cause correlated performance degradations at even non-faulty nodes, leading to potential red herrings in failure diagnosis. We propose a fingerpointing approach that combines node-level (local) anomaly detection, followed by system-wide (global) fingerpointing. The local anomaly detection relies on threshold-based analyses of system metrics, while global fingerpointing is based on the hypothesis that the root-cause of the failure is the node with an "odd-man-out" view of the anomalies. We compare the results of applying three classifiers - a heuristic algorithm, an unsupervised learner (k-means clustering), and a supervised learner (k-nearest-neighbor) - to finger-point the faulty node.