Experience mining Google's production console logs
SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
Proceedings of the 22nd Conference of the Computer-Human Interaction Special Interest Group of Australia on Computer-Human Interaction
Adaptive event prediction strategy with dynamic time window for large-scale HPC systems
SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
PerfXplain: debugging MapReduce job performance
Proceedings of the VLDB Endowment
Selective resource characterization for evaluation of system dynamics
ACM SIGMETRICS Performance Evaluation Review
Fault prediction under the microscope: a closer look into HPC systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A framework to compute statistics of system parameters from very large trace files
ACM SIGOPS Operating Systems Review
Hi-index | 0.00 |
We describe a novel application of using data mining and statistical learning methods to automatically monitor and detect abnormal execution traces from console logs in an online setting. Different from existing solutions, we use a two stage detection system. The first stage uses frequent pattern mining and distribution estimation techniques to capture the dominant patterns (both frequent sequences and time duration). The second stage use principal component analysis based anomaly detection technique to identify actual problems. Using real system data from a 203-node Hadoop [1] cluster, we show that we can not only achieve highly accurate and fast problem detection, but also help operators better understand execution patterns in their system.