Current research and practice in proactive fault management

Authors:
Y. Li;Z. Lan
Affiliations:
Illinois Institute of Technology, Chicago, IL;Illinois Institute of Technology, Chicago, IL
Venue:
International Journal of Computers and Applications
Year:
2007

Citing 13
Cited 0

A statistical approach to predictive detection

Computer Networks: The International Journal of Computer and Telecommunications Networking - Special issue on selected topics in network and systems management
Bayesian approaches to failure prediction for disk drives

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Networked Windows NT System Field Failure Data Analysis

PRDC '99 Proceedings of the 1999 Pacific Rim International Symposium on Dependable Computing
Software Rejuvenation: Analysis, Module and Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Critical event prediction for proactive management in large-scale computer clusters

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Proactive Recovery in Distributed CORBA Applications

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Failure Data Analysis of a Large-Scale Heterogeneous Server Environment

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Automatic methods for predicting machine availability in desktop Grid and peer-to-peer systems

CCGRID '04 Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid
Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing

CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
Scalable diskless checkpointing for large parallel systems

Scalable diskless checkpointing for large parallel systems
Predictive algorithms in the management of computer systems

IBM Systems Journal
Proactive management of software aging

IBM Journal of Research and Development
Performance implications of failures in large-scale cluster scheduling

JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Unlike rollback-recovery, proactive fault management takes preventive actions before the occurrence of failures. In this survey paper, we classify the current research of proactive fault management into two broad categories: failure analysis and prediction, and proactive techniques. Analytical methods have been widely used to analyse and forecast contiguous values, while data mining or machine learning methods are mostly suited to categorical data. Various proactive fault management systems have been recently developed, each of them exploring different proactive techniques to achieve its specific design goal. Our investigation shows that research should be conducted in the context of high performance computing to enable efficient proactive fault management for the emerging large-scale supercomputers.