Bad Words: Finding Faults in Spirit's Syslogs

Authors:
Jon Stearley;Adam J. Oliner
Affiliations:
-;-
Venue:
CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Year:
2008

Citing 0
Cited 9

End-to-end framework for fault management for open source clusters: Ranger

Proceedings of the 2010 TeraGrid Conference
Using Cloud Constructs and Predictive Analysis to Enable Pre-Failure Process Migration in HPC Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Hunting for problems with Artemis

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Predicting computer system failures using support vector machines

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
SALSA: analyzing logs as state machines

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Symptom-based problem determination using log data abstraction

Proceedings of the 2010 Conference of the Center for Advanced Studies on Collaborative Research
Bridging the gaps: joining information sources with Splunk

SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
3-Dimensional root cause diagnosis via co-analysis

Proceedings of the 9th international conference on Autonomic computing
Failure prediction based on log files using Random Indexing and Support Vector Machines

Journal of Systems and Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

Accurate fault detection is a key element of resilient computing. Syslogs provide key information regarding faults, and are found on nearly all computing systems. Discovering new fault types requires expert human effort, however, as no previous algorithm has been shown to localize faults in time and space with an operationally acceptable false positive rate. We present experiments on three weeks of syslogs from Sandia's 512-node "Spirit"' Linux cluster, showing one algorithm that localizes 50% of faults with 75% precision, corresponding to an excellent false positive rate of 0.05%. The salient characteristics of this algorithm are (1) calculation of nodewise information entropy, and (2) encoding of word position. The key observation is that similar computers correctly executing similar work should produce similar logs.