Failure Prediction in IBM BlueGene/L Event Logs

Authors:
Yinglung Liang;Yanyong Zhang;Hui Xiong;Ramendra Sahoo
Affiliations:
-;-;-;-
Venue:
ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Year:
2007

Citing 0
Cited 14

An analysis of clustered failures on large supercomputing systems

Journal of Parallel and Distributed Computing
Sustainable operation and management of data center chillers using temporal data mining

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
A study of dynamic meta-learning for failure prediction in large-scale systems

Journal of Parallel and Distributed Computing
Adaptive system anomaly prediction for large-scale hosting infrastructures

Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
End-to-end framework for fault management for open source clusters: Ranger

Proceedings of the 2010 TeraGrid Conference
Predicting computer system failures using support vector machines

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Mining hot clusters of similar anomalies for system management

PRICAI'10 Proceedings of the 11th Pacific Rim international conference on Trends in artificial intelligence
Using syslog message sequences for predicting disk failures

LISA'10 Proceedings of the 24th international conference on Large installation system administration
Bridging the gaps: joining information sources with Splunk

SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
Temporal data mining approaches for sustainable chiller management in data centers

ACM Transactions on Intelligent Systems and Technology (TIST)
Failure prediction based on log files using Random Indexing and Support Vector Machines

Journal of Systems and Software
A comparison of machine learning algorithms for proactive hard disk drive failure detection

Proceedings of the 4th international ACM Sigsoft symposium on Architecting critical systems
Failure analysis of distributed scientific workflows executing in the cloud

Proceedings of the 8th International Conference on Network and Service Management
Checkpointing algorithms and fault prediction

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Frequent failures are becoming a serious concern to the community of high-end computing, especially when the applications and the underlying systems rapidly grow in size and complexity. In order to develop effective fault-tolerant strategies, there is a critical need to predict failure events. To this end, we have collected detailed event logs from IBM BlueGene/L, which has 128K processors, and is currently the fastest supercomputer in the world. In this study, we first show how the event records can be converted into a data set that is appropriate for running classification techniques. Then we apply classifiers on the data, including RIPPER (a rule-based classifier), Support Vector Machines (SVMs), a traditional Nearest Neighbor method, and a customized Nearest Neighbor method. We show that the customized nearest neighbor approach can outperform RIPPER and SVMs in terms of both coverage and precision. The results suggest that the customized nearest neighbor approach can be used to alleviate the impact of failures.