A statistical approach to predictive detection
Computer Networks: The International Journal of Computer and Telecommunications Networking - Special issue on selected topics in network and systems management
Mining needle in a haystack: classifying rare classes via two-phase rule induction
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Bayesian approaches to failure prediction for disk drives
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
A comparative analysis of event tupling schemes
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Predicting Rare Events In Temporal Domains
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
A Measurement-Based Model for Estimation of Resource Exhaustion in Operational Software Systems
ISSRE '99 Proceedings of the 10th International Symposium on Software Reliability Engineering
Critical event prediction for proactive management in large-scale computer clusters
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining with rarity: a unifying framework
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery
IEEE Transactions on Dependable and Secure Computing
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Failure Diagnosis Using Decision Trees
ICAC '04 Proceedings of the First International Conference on Autonomic Computing
ICAC '05 Proceedings of the Second International Conference on Automatic Computing
Mining Logs Files for Computing System Management
ICAC '05 Proceedings of the Second International Conference on Automatic Computing
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
BlueGene/L Failure Analysis and Prediction Models
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
What Supercomputers Say: A Study of Five System Logs
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
A Meta-Learning Failure Predictor for Blue Gene/L Systems
ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
Exploring event correlation for failure prediction in coalitions of clusters
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Failure Prediction in IBM BlueGene/L Event Logs
ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study
ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Adaptive Fault Management of Parallel Applications for High-Performance Computing
IEEE Transactions on Computers
Fault-Aware Runtime Strategies for High-Performance Computing
IEEE Transactions on Parallel and Distributed Systems
Overview of the Blue Gene/L system architecture
IBM Journal of Research and Development
Cooperative checkpointing theory
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Performance implications of failures in large-scale cluster scheduling
JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
ACR: automatic checkpoint/restart for soft and hard error protection
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Failure prediction for HPC systems and applications: Current situation and open issues
International Journal of High Performance Computing Applications
Hi-index | 0.00 |
Despite years of study on failure prediction, it remains an open problem, especially in large-scale systems composed of vast amount of components. In this paper, we present a dynamic meta-learning framework for failure prediction. It intends to not only provide reasonable prediction accuracy, but also be of practical use in realistic environments. Two key techniques are developed to address technical challenges of failure prediction. One is meta-learning to boost prediction accuracy by combining the benefits of multiple predictive techniques. The other is a dynamic approach to dynamically obtain failure patterns from a changing training set and to dynamically extract effective rules by actively monitoring prediction accuracy at runtime. We demonstrate the effectiveness and practical use of this framework by means of real system logs collected from the production Blue Gene/L systems at Argonne National Laboratory and San Diego Supercomputer Center. Our case studies indicate that the proposed mechanism can provide reasonable prediction accuracy by forecasting up to 82% of the failures, with a runtime overhead less than 1.0 min.