Performance analysis of checkpointing strategies
ACM Transactions on Computer Systems (TOCS)
An overview of the BlueGene/L Supercomputer
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Dynamic syslog mining for network failure monitoring
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
An introduction to ROC analysis
Pattern Recognition Letters - Special issue: ROC analysis in pattern recognition
Practical Unix & Internet Security, 3rd Edition
Practical Unix & Internet Security, 3rd Edition
Failure trends in a large disk drive population
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Fault-Driven Re-Scheduling For Improving System-level Fault Resilience
ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
A Survey on Failure Prediction of Large-Scale Server Clusters
SNPD '07 Proceedings of the Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing - Volume 02
Exploring event correlation for failure prediction in coalitions of clusters
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Bad Words: Finding Faults in Spirit's Syslogs
CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Failure Prediction in IBM BlueGene/L Event Logs
ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
In-the-dark network traffic classification using support vector machines
IAAI'08 Proceedings of the 20th national conference on Innovative applications of artificial intelligence - Volume 3
Using gap-insensitive string kernel to detect masquerading
ADMA'05 Proceedings of the First international conference on Advanced Data Mining and Applications
End-to-end framework for fault management for open source clusters: Ranger
Proceedings of the 2010 TeraGrid Conference
Symptom-based problem determination using log data abstraction
Proceedings of the 2010 Conference of the Center for Advanced Studies on Collaborative Research
Using syslog message sequences for predicting disk failures
LISA'10 Proceedings of the 24th international conference on Large installation system administration
Failure prediction based on log files using Random Indexing and Support Vector Machines
Journal of Systems and Software
Fmeter: extracting indexable low-level system signatures by counting kernel function calls
Proceedings of the 13th International Middleware Conference
Classification of Log Files with Limited Labeled Data
Proceedings of Principles, Systems and Applications on IP Telecommunications
Checkpointing algorithms and fault prediction
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
Mitigating the impact of computer failure is possible if accurate failure predictions are provided. Resources, applications, and services can be scheduled around predicted failure and limit the impact. Such strategies are especially important for multi-computer systems, such as compute clusters, that experience a higher rate failure due to the large number of components. However providing accurate predictions with sufficient lead time remains a challenging problem. This paper describes a new spectrum-kernel Support Vector Machine (SVM) approach to predict failure events based on system log files. These files containmessages that represent a change of system state. While a single message in the file may not be sufficient for predicting failure, a sequence or pattern of messages may be. The approach described in this paper will use a sliding window (sub-sequence) of messages to predict the likelihood of failure. The a frequency representation of the message sub-sequences observed are then used as input to the SVM. The SVM then associates the messages to a class of failed or non-failed system. Experimental results using actual system log files from a Linux-based compute cluster indicate the proposed spectrum-kernel SVM approach has promise and can predict hard disk failure with an accuracy of 73% two days in advance.