A statistical approach to predictive detection
Computer Networks: The International Journal of Computer and Telecommunications Networking - Special issue on selected topics in network and systems management
Bayesian approaches to failure prediction for disk drives
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Networked Windows NT System Field Failure Data Analysis
PRDC '99 Proceedings of the 1999 Pacific Rim International Symposium on Dependable Computing
Software Rejuvenation: Analysis, Module and Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Critical event prediction for proactive management in large-scale computer clusters
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Proactive Recovery in Distributed CORBA Applications
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Failure Data Analysis of a Large-Scale Heterogeneous Server Environment
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Automatic methods for predicting machine availability in desktop Grid and peer-to-peer systems
CCGRID '04 Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid
Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing
CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
Scalable diskless checkpointing for large parallel systems
Scalable diskless checkpointing for large parallel systems
Predictive algorithms in the management of computer systems
IBM Systems Journal
Proactive management of software aging
IBM Journal of Research and Development
Performance implications of failures in large-scale cluster scheduling
JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Hi-index | 0.00 |
Unlike rollback-recovery, proactive fault management takes preventive actions before the occurrence of failures. In this survey paper, we classify the current research of proactive fault management into two broad categories: failure analysis and prediction, and proactive techniques. Analytical methods have been widely used to analyse and forecast contiguous values, while data mining or machine learning methods are mostly suited to categorical data. Various proactive fault management systems have been recently developed, each of them exploring different proactive techniques to achieve its specific design goal. Our investigation shows that research should be conducted in the context of high performance computing to enable efficient proactive fault management for the emerging large-scale supercomputers.