End-to-end framework for fault management for open source clusters: Ranger
Proceedings of the 2010 TeraGrid Conference
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Hunting for problems with Artemis
WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Predicting computer system failures using support vector machines
WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
SALSA: analyzing logs as state machines
WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Symptom-based problem determination using log data abstraction
Proceedings of the 2010 Conference of the Center for Advanced Studies on Collaborative Research
Bridging the gaps: joining information sources with Splunk
SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
3-Dimensional root cause diagnosis via co-analysis
Proceedings of the 9th international conference on Autonomic computing
Failure prediction based on log files using Random Indexing and Support Vector Machines
Journal of Systems and Software
Hi-index | 0.00 |
Accurate fault detection is a key element of resilient computing. Syslogs provide key information regarding faults, and are found on nearly all computing systems. Discovering new fault types requires expert human effort, however, as no previous algorithm has been shown to localize faults in time and space with an operationally acceptable false positive rate. We present experiments on three weeks of syslogs from Sandia's 512-node "Spirit"' Linux cluster, showing one algorithm that localizes 50% of faults with 75% precision, corresponding to an excellent false positive rate of 0.05%. The salient characteristics of this algorithm are (1) calculation of nodewise information entropy, and (2) encoding of word position. The key observation is that similar computers correctly executing similar work should produce similar logs.