Online event correlations analysis in system logs of large-scale cluster systems

  • Authors:
  • Wei Zhou;Jianfeng Zhan;Dan Meng;Zhihong Zhang

  • Affiliations:
  • Institute of Computing Technology, Chinese Academy of Sciences;Institute of Computing Technology, Chinese Academy of Sciences;Institute of Computing Technology, Chinese Academy of Sciences;The Research Institution of China Mobile

  • Venue:
  • NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

It has been long recognized that failure events are correlated, not independent. Previous research efforts have shown the correlation analysis of system logs is helpful to resource allocation, job scheduling and proactive management. However, previous log analysis methods analyze the history logs offline. They fail to capture the dynamic change of system errors and failures. In this paper, we purpose an online log analysis approach to mine event correlations in system logs of large-scale cluster systems. Our contributions are three-fold: first, we analyze the event correlations of system logs of a 260-nodes production Hadoop cluster system, and the result shows that the correlation rules of logs change dramatically in different periods; Second, we present a online log analysis algorithm Apriori-SO; third, based on the online event correlations mining, we present an online event prediction method that can predict diversities of failure events with the great detail. The experiment result of a 260-nodes production Hadoop cluster system shows that our online log analysis algorithm can analyze the log streams to obtain event correlation rules in soft real time, and our online event prediction method can achieve higher precision rate and recall rate than the offline log analysis approach.