A query language for understanding component interactions in production systems
Proceedings of the 24th ACM International Conference on Supercomputing
End-to-end framework for fault management for open source clusters: Ranger
Proceedings of the 2010 TeraGrid Conference
Symptom-based problem determination using log data abstraction
Proceedings of the 2010 Conference of the Center for Advanced Studies on Collaborative Research
Proceedings of the 2011 ACM Symposium on Applied Computing
Modeling and tolerating heterogeneous failures in large parallel systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
LogSig: generating system events from raw textual logs
Proceedings of the 20th ACM international conference on Information and knowledge management
Provenance for system troubleshooting
LISA'11 Proceedings of the 25th international conference on Large Installation System Administration
Spatio-temporal decomposition, clustering and identification for alert detection in system logs
Proceedings of the 27th Annual ACM Symposium on Applied Computing
Searching similar segments over textual event sequences
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Hi-index | 0.00 |
We present Nodeinfo, an unsupervised algorithm for anomaly detection in system logs. We demonstrate Nodeinfo's effectiveness on data from four of the world's most powerful supercomputers: using logs representing over 746 million processor-hours, in which anomalous events called alerts were manually tagged for scoring, we aim to automatically identify the regions of the log containing those alerts. We formalize the alert detection task in these terms, describe how Nodeinfo uses the information entropy of message terms to identify alerts, and present an online version of this algorithm, which is now in production use. This is the first work to investigate alert detection on (several) publicly-available supercomputer system logs, thereby providing a reproducible performance baseline.