Pinpoint: Problem Determination in Large, Dynamic Internet Services
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Capturing, indexing, clustering, and retrieving system history
Proceedings of the twentieth ACM symposium on Operating systems principles
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Exploiting nonstationarity for performance prediction
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Performance comparison of middleware architectures for generating dynamic web content
Proceedings of the ACM/IFIP/USENIX 2003 International Conference on Middleware
Ranking the importance of alerts for problem determination in large computer systems
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
System monitoring with metric-correlation models: problems and solutions
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Detecting large-scale system problems by mining console logs
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Fingerprinting the datacenter: automated classification of performance crises
Proceedings of the 5th European conference on Computer systems
PeerWatch: a fault detection and diagnosis tool for virtualized consolidation systems
Proceedings of the 7th international conference on Autonomic computing
Visual, Log-Based Causal Tracing for Performance Debugging of MapReduce Systems
ICDCS '10 Proceedings of the 2010 IEEE 30th International Conference on Distributed Computing Systems
Lightweight, high-resolution monitoring for troubleshooting production systems
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Detecting application-level failures in component-based Internet services
IEEE Transactions on Neural Networks
Hi-index | 0.00 |
Detecting failures in distributed systems is challenging, as modern datacenters run a variety of applications. Current techniques for detecting failures often require training, have limited scalability, or have results that are hard to interpret. We present LFD, a light-weight technique to quickly detect performance problems in distributed systems using only correlations of OS metrics. LFD is based on our hypothesis of server application behavior, does not require training, and detects failures with complexity linear in the number of nodes, with results that are interpretable by sysadmins. We further show that LFD is versatile, and can diagnose faults in Hadoop MapReduce systems and on multi-tier web request systems, and show how LFD is intuitive to sysadmins.