An analysis of clustered failures on large supercomputing systems
Journal of Parallel and Distributed Computing
Sustainable operation and management of data center chillers using temporal data mining
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
A study of dynamic meta-learning for failure prediction in large-scale systems
Journal of Parallel and Distributed Computing
Adaptive system anomaly prediction for large-scale hosting infrastructures
Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
End-to-end framework for fault management for open source clusters: Ranger
Proceedings of the 2010 TeraGrid Conference
Predicting computer system failures using support vector machines
WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Mining hot clusters of similar anomalies for system management
PRICAI'10 Proceedings of the 11th Pacific Rim international conference on Trends in artificial intelligence
Using syslog message sequences for predicting disk failures
LISA'10 Proceedings of the 24th international conference on Large installation system administration
Bridging the gaps: joining information sources with Splunk
SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
Temporal data mining approaches for sustainable chiller management in data centers
ACM Transactions on Intelligent Systems and Technology (TIST)
Failure prediction based on log files using Random Indexing and Support Vector Machines
Journal of Systems and Software
A comparison of machine learning algorithms for proactive hard disk drive failure detection
Proceedings of the 4th international ACM Sigsoft symposium on Architecting critical systems
Failure analysis of distributed scientific workflows executing in the cloud
Proceedings of the 8th International Conference on Network and Service Management
Checkpointing algorithms and fault prediction
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
Frequent failures are becoming a serious concern to the community of high-end computing, especially when the applications and the underlying systems rapidly grow in size and complexity. In order to develop effective fault-tolerant strategies, there is a critical need to predict failure events. To this end, we have collected detailed event logs from IBM BlueGene/L, which has 128K processors, and is currently the fastest supercomputer in the world. In this study, we first show how the event records can be converted into a data set that is appropriate for running classification techniques. Then we apply classifiers on the data, including RIPPER (a rule-based classifier), Support Vector Machines (SVMs), a traditional Nearest Neighbor method, and a customized Nearest Neighbor method. We show that the customized nearest neighbor approach can outperform RIPPER and SVMs in terms of both coverage and precision. The results suggest that the customized nearest neighbor approach can be used to alleviate the impact of failures.