Analysis and Modeling of Correlated Failures in Multicomputer Systems
IEEE Transactions on Computers - Special issue on fault-tolerant computing
The NetLogger Methodology for High Performance Distributed Systems Performance Analysis
HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
Critical event prediction for proactive management in large-scale computer clusters
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
An Introduction to Computing System Dependability
Proceedings of the 26th International Conference on Software Engineering
Failure Data Analysis of a Large-Scale Heterogeneous Server Environment
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Refereed Papers: Real-time Log File Analysis Using the Simple Event Correlator (SEC)
LISA '04 Proceedings of the 18th USENIX conference on System administration
Research issues in data stream association rule mining
ACM SIGMOD Record
BlueGene/L Failure Analysis and Prediction Models
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
What Supercomputers Say: A Study of Five System Logs
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
A Meta-Learning Failure Predictor for Blue Gene/L Systems
ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
Quantifying Temporal and Spatial Correlation of Failure Events for Proactive Management
SRDS '07 Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems
BorderPatrol: isolating events for black-box tracing
Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Exploring event correlation for failure prediction in coalitions of clusters
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
An analysis of clustered failures on large supercomputing systems
Journal of Parallel and Distributed Computing
Error log processing for accurate failure prediction
WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Hi-index | 0.00 |
It has been long recognized that failure events are correlated, not independent. Previous research efforts have shown the correlation analysis of system logs is helpful to resource allocation, job scheduling and proactive management. However, previous log analysis methods analyze the history logs offline. They fail to capture the dynamic change of system errors and failures. In this paper, we purpose an online log analysis approach to mine event correlations in system logs of large-scale cluster systems. Our contributions are three-fold: first, we analyze the event correlations of system logs of a 260-nodes production Hadoop cluster system, and the result shows that the correlation rules of logs change dramatically in different periods; Second, we present a online log analysis algorithm Apriori-SO; third, based on the online event correlations mining, we present an online event prediction method that can predict diversities of failure events with the great detail. The experiment result of a 260-nodes production Hadoop cluster system shows that our online log analysis algorithm can analyze the log streams to obtain event correlation rules in soft real time, and our online event prediction method can achieve higher precision rate and recall rate than the offline log analysis approach.