Online System Problem Detection by Mining Patterns of Console Logs

Authors:
Wei Xu;Ling Huang;Armando Fox;David Patterson;Michael Jordan
Affiliations:
-;-;-;-;-
Venue:
ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
Year:
2009

Citing 0
Cited 7

Experience mining Google's production console logs

SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
Event log messages as a human interface, or, "do you pine for the days when men were men and wrote their own device drivers?"

Proceedings of the 22nd Conference of the Computer-Human Interaction Special Interest Group of Australia on Computer-Human Interaction
Adaptive event prediction strategy with dynamic time window for large-scale HPC systems

SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
PerfXplain: debugging MapReduce job performance

Proceedings of the VLDB Endowment
Selective resource characterization for evaluation of system dynamics

ACM SIGMETRICS Performance Evaluation Review
Fault prediction under the microscope: a closer look into HPC systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A framework to compute statistics of system parameters from very large trace files

ACM SIGOPS Operating Systems Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe a novel application of using data mining and statistical learning methods to automatically monitor and detect abnormal execution traces from console logs in an online setting. Different from existing solutions, we use a two stage detection system. The first stage uses frequent pattern mining and distribution estimation techniques to capture the dominant patterns (both frequent sequences and time duration). The second stage use principal component analysis based anomaly detection technique to identify actual problems. Using real system data from a 203-node Hadoop [1] cluster, we show that we can not only achieve highly accurate and fast problem detection, but also help operators better understand execution patterns in their system.