Light-weight black-box failure detection for distributed systems

Authors:
Jiaqi Tan;Soila Kavulya;Rajeev Gandhi;Priya Narasimhan
Affiliations:
Carnegie Mellon University, Pittsburgh, USA;Carnegie Mellon University, Pittsburgh, USA;Carnegie Mellon University, Pittsburgh, USA;Carnegie Mellon University, Pittsburgh, USA
Venue:
Proceedings of the 2012 workshop on Management of big data systems
Year:
2012

Citing 13
Cited 0

Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Capturing, indexing, clustering, and retrieving system history

Proceedings of the twentieth ACM symposium on Operating systems principles
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Exploiting nonstationarity for performance prediction

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Performance comparison of middleware architectures for generating dynamic web content

Proceedings of the ACM/IFIP/USENIX 2003 International Conference on Middleware
Ranking the importance of alerts for problem determination in large computer systems

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
System monitoring with metric-correlation models: problems and solutions

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Detecting large-scale system problems by mining console logs

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Fingerprinting the datacenter: automated classification of performance crises

Proceedings of the 5th European conference on Computer systems
PeerWatch: a fault detection and diagnosis tool for virtualized consolidation systems

Proceedings of the 7th international conference on Autonomic computing
Visual, Log-Based Causal Tracing for Performance Debugging of MapReduce Systems

ICDCS '10 Proceedings of the 2010 IEEE 30th International Conference on Distributed Computing Systems
Lightweight, high-resolution monitoring for troubleshooting production systems

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Detecting application-level failures in component-based Internet services

IEEE Transactions on Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

Detecting failures in distributed systems is challenging, as modern datacenters run a variety of applications. Current techniques for detecting failures often require training, have limited scalability, or have results that are hard to interpret. We present LFD, a light-weight technique to quickly detect performance problems in distributed systems using only correlations of OS metrics. LFD is based on our hypothesis of server application behavior, does not require training, and detects failures with complexity linear in the number of nodes, with results that are interpretable by sysadmins. We further show that LFD is versatile, and can diagnose faults in Hadoop MapReduce systems and on multi-tier web request systems, and show how LFD is intuitive to sysadmins.