Online event correlations analysis in system logs of large-scale cluster systems

Authors:
Wei Zhou;Jianfeng Zhan;Dan Meng;Zhihong Zhang
Affiliations:
Institute of Computing Technology, Chinese Academy of Sciences;Institute of Computing Technology, Chinese Academy of Sciences;Institute of Computing Technology, Chinese Academy of Sciences;The Research Institution of China Mobile
Venue:
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Year:
2010

Citing 15
Cited 0

Analysis and Modeling of Correlated Failures in Multicomputer Systems

IEEE Transactions on Computers - Special issue on fault-tolerant computing
The NetLogger Methodology for High Performance Distributed Systems Performance Analysis

HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
Critical event prediction for proactive management in large-scale computer clusters

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
An Introduction to Computing System Dependability

Proceedings of the 26th International Conference on Software Engineering
Failure Data Analysis of a Large-Scale Heterogeneous Server Environment

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Refereed Papers: Real-time Log File Analysis Using the Simple Event Correlator (SEC)

LISA '04 Proceedings of the 18th USENIX conference on System administration
Research issues in data stream association rule mining

ACM SIGMOD Record
BlueGene/L Failure Analysis and Prediction Models

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
What Supercomputers Say: A Study of Five System Logs

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
A Meta-Learning Failure Predictor for Blue Gene/L Systems

ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
Quantifying Temporal and Spatial Correlation of Failure Events for Proactive Management

SRDS '07 Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems
BorderPatrol: isolating events for black-box tracing

Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Exploring event correlation for failure prediction in coalitions of clusters

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
An analysis of clustered failures on large supercomputing systems

Journal of Parallel and Distributed Computing
Error log processing for accurate failure prediction

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs

Quantified Score

Hi-index	0.00

Visualization

Abstract

It has been long recognized that failure events are correlated, not independent. Previous research efforts have shown the correlation analysis of system logs is helpful to resource allocation, job scheduling and proactive management. However, previous log analysis methods analyze the history logs offline. They fail to capture the dynamic change of system errors and failures. In this paper, we purpose an online log analysis approach to mine event correlations in system logs of large-scale cluster systems. Our contributions are three-fold: first, we analyze the event correlations of system logs of a 260-nodes production Hadoop cluster system, and the result shows that the correlation rules of logs change dramatically in different periods; Second, we present a online log analysis algorithm Apriori-SO; third, based on the online event correlations mining, we present an online event prediction method that can predict diversities of failure events with the great detail. The experiment result of a 260-nodes production Hadoop cluster system shows that our online log analysis algorithm can analyze the log streams to obtain event correlation rules in soft real time, and our online event prediction method can achieve higher precision rate and recall rate than the offline log analysis approach.