Detecting large-scale system problems by mining console logs

Authors:
Wei Xu;Ling Huang;Armando Fox;David Patterson;Michael I. Jordan
Affiliations:
University of California at Berkeley, Berkeley, CA, USA;Intel Labs Berkeley, Berkeley, CA, USA;University of California at Berkeley, Berkeley, CA, USA;University of California at Berkeley, Berkeley, CA, USA;University of California at Berkeley, Berkeley, CA, USA
Venue:
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Year:
2009

Citing 22
Cited 59

Modern compiler implementation in Java

Modern compiler implementation in Java
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Mining Partially Periodic Event Patterns with Unknown Periods

Proceedings of the 17th International Conference on Data Engineering
Diagnosing network-wide traffic anomalies

Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications
Automated System Monitoring and Notification With Swatch

LISA '93 Proceedings of the 7th USENIX conference on System administration
Lucene in Action (In Action series)

Lucene in Action (In Action series)
Why inverse document frequency?

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data

Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data
Dynamic syslog mining for network failure monitoring

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Towards informatic analysis of syslogs

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Detecting similar Java classes using tree algorithms

Proceedings of the 2006 international workshop on Mining software repositories
YALE: rapid prototyping for complex data mining tasks

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Path-based faliure and evolution management

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
What Supercomputers Say: A Study of Five System Logs

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
/*icomment: bugs or bad comments?*/

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
From dirt to shovels: fully automatic tool generation from ad hoc data

Proceedings of the 35th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Introduction to Information Retrieval

Introduction to Information Retrieval
Understanding customer problem troubleshooting from storage system logs

FAST '09 Proccedings of the 7th conference on File and storage technologies
Clustering event logs using iterative partitioning

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Discovering actionable patterns in event data

IBM Systems Journal
SALSA: analyzing logs as state machines

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
X-trace: a pervasive network tracing framework

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation

Online aggregation and continuous query support in MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A query language for understanding component interactions in production systems

Proceedings of the 24th ACM International Conference on Supercomputing
Adaptive system anomaly prediction for large-scale hosting infrastructures

Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
Detecting the performance impact of upgrades in large operational networks

Proceedings of the ACM SIGCOMM 2010 conference
California fault lines: understanding the causes and impact of network failures

Proceedings of the ACM SIGCOMM 2010 conference
MapReduce online

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Mining invariants from console logs for system problem detection

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Experiences with tracing causality in networked services

INM/WREN'10 Proceedings of the 2010 internet network management conference on Research on enterprise networking
Performance analysis of idle programs

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
What happened in my network: mining network events from router syslogs

IMC '10 Proceedings of the 10th ACM SIGCOMM conference on Internet measurement
Symptom-based problem determination using log data abstraction

Proceedings of the 2010 Conference of the Center for Advanced Studies on Collaborative Research
Chukwa: a system for reliable large-scale log collection

LISA'10 Proceedings of the 24th international conference on Large installation system administration
Creating the knowledge about IT events

SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
Synoptic: summarizing system logs with refinement

SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
A graphical representation for identifier structure in logs

SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
Experience mining Google's production console logs

SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
Bridging the gaps: joining information sources with Splunk

SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
Diagnosing performance changes by comparing request flows

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Storage and retrieval of system log events using a structured schema based on message type transformation

Proceedings of the 2011 ACM Symposium on Applied Computing
ASDF: an automated, online framework for diagnosing performance problems

Architecting dependable systems VII
Temporal data mining approaches for sustainable chiller management in data centers

ACM Transactions on Intelligent Systems and Technology (TIST)
Anomaly detection techniques for a web defacement monitoring service

Expert Systems with Applications: An International Journal
Otus: resource attribution in data-intensive clusters

Proceedings of the second international workshop on MapReduce and its applications
G2: a graph processing system for diagnosing distributed systems

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Leveraging existing instrumentation to automatically infer invariant-constrained models

Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering
P3CA: private anomaly detection across ISP networks

PETS'11 Proceedings of the 11th international conference on Privacy enhancing technologies
Mining temporal invariants from partially ordered logs

SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Advances and challenges in log analysis

Communications of the ACM
Advances and Challenges in Log Analysis

Queue - Log Analysis
Mining temporal invariants from partially ordered logs

ACM SIGOPS Operating Systems Review
AutoLog: facing log redundancy and insufficiency

Proceedings of the Second Asia-Pacific Workshop on Systems
Improving Software Diagnosability via Log Enhancement

ACM Transactions on Computer Systems (TOCS) - Special Issue APLOS 2011
Precomputing possible configuration error diagnoses

ASE '11 Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering
Provenance for system troubleshooting

LISA'11 Proceedings of the 25th international conference on Large Installation System Administration
Structured comparative analysis of systems logs to diagnose performance problems

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Understanding and detecting real-world performance bugs

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
ADP: automated diagnosis of performance pathologies using hardware events

Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems
Characterizing logging practices in open-source software

Proceedings of the 34th International Conference on Software Engineering
Bridging the divide between software developers and operators using logs

Proceedings of the 34th International Conference on Software Engineering
Light-weight black-box failure detection for distributed systems

Proceedings of the 2012 workshop on Management of big data systems
Be conservative: enhancing failure diagnosis with proactive logging

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Fmeter: extracting indexable low-level system signatures by counting kernel function calls

Proceedings of the 13th International Middleware Conference
Provenance from log files: a BigData problem

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Stream-based event prediction using bayesian and bloom filters

Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering
Juggling the Jigsaw: towards automated problem inference from network trouble tickets

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Developing a predictive model of quality of experience for internet video

Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM
Assisting developers of big data analytics applications when deploying on hadoop clouds

Proceedings of the 2013 International Conference on Software Engineering
CAPRI: a tool for mining complex line patterns in large log data

Proceedings of the 2nd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
A comparison of syslog and IS-IS for network failure analysis

Proceedings of the 2013 conference on Internet measurement conference
Carat: collaborative energy diagnosis for mobile devices

Proceedings of the 11th ACM Conference on Embedded Networked Sensor Systems
Limplock: understanding the impact of limpware on scale-out cloud systems

Proceedings of the 4th annual Symposium on Cloud Computing
Aligning event logs and process models for multi-perspective conformance checking: an approach based on integer linear programming

BPM'13 Proceedings of the 11th international conference on Business Process Management
EnCore: exploiting system environment and correlation information for misconfiguration detection

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Performance troubleshooting in data centers: an annotated bibliography?

ACM SIGOPS Operating Systems Review
Challenges to error diagnosis in hadoop ecosystems

LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
Workload-aware anomaly detection for Web applications

Journal of Systems and Software
An improved partitioning mechanism for optimizing massive data analysis using MapReduce

The Journal of Supercomputing
Structured and Interoperable Logging for the Cloud Computing Era: The Pitfalls and Benefits

UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing
NetCheck: network diagnoses from blackbox traces

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.02

Visualization

Abstract

Surprisingly, console logs rarely help operators detect problems in large-scale datacenter services, for they often consist of the voluminous intermixing of messages from many software components written by independent developers. We propose a general methodology to mine this rich source of information to automatically detect system runtime problems. We first parse console logs by combining source code analysis with information retrieval to create composite features. We then analyze these features using machine learning to detect operational problems. We show that our method enables analyses that are impossible with previous methods because of its superior ability to create sophisticated features. We also show how to distill the results of our analysis to an operator-friendly one-page decision tree showing the critical messages associated with the detected problems. We validate our approach using the Darkstar online game server and the Hadoop File System, where we detect numerous real problems with high accuracy and few false positives. In the Hadoop case, we are able to analyze 24 million lines of console logs in 3 minutes. Our methodology works on textual console logs of any size and requires no changes to the service software, no human input, and no knowledge of the software's internals.