Light-weight black-box failure detection for distributed systems

  • Authors:
  • Jiaqi Tan;Soila Kavulya;Rajeev Gandhi;Priya Narasimhan

  • Affiliations:
  • Carnegie Mellon University, Pittsburgh, USA;Carnegie Mellon University, Pittsburgh, USA;Carnegie Mellon University, Pittsburgh, USA;Carnegie Mellon University, Pittsburgh, USA

  • Venue:
  • Proceedings of the 2012 workshop on Management of big data systems
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Detecting failures in distributed systems is challenging, as modern datacenters run a variety of applications. Current techniques for detecting failures often require training, have limited scalability, or have results that are hard to interpret. We present LFD, a light-weight technique to quickly detect performance problems in distributed systems using only correlations of OS metrics. LFD is based on our hypothesis of server application behavior, does not require training, and detects failures with complexity linear in the number of nodes, with results that are interpretable by sysadmins. We further show that LFD is versatile, and can diagnose faults in Hadoop MapReduce systems and on multi-tier web request systems, and show how LFD is intuitive to sysadmins.