A proactive fault-detection mechanism in large-scale cluster systems

Authors:
Wu Linping;Meng Dan;Gao Wen;Zhan Jianfeng
Affiliations:
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China and Graduate School of the Chinese Academy of Sciences Beijing, China;Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China;Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China;Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Venue:
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Year:
2006

Citing 7
Cited 0

Bayesian approaches to failure prediction for disk drives

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
System Management in the BlueGene/L Supercomputer

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
A Methodology for Detection and Estimation of Software Aging

ISSRE '98 Proceedings of the The Ninth International Symposium on Software Reliability Engineering
An Approach for Estimation of Software Aging in a Web Server

ISESE '02 Proceedings of the 2002 International Symposium on Empirical Software Engineering
Software Rejuvenation: Analysis, Module and Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Critical event prediction for proactive management in large-scale computer clusters

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Proactive management of software aging

IBM Journal of Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

To improve the whole dependability of large-scale cluster systems, an online fault detection mechanism is proposed in this paper. This mechanism can detect the fault in time before node fails and enables the proactive fault management. The proposed mechanism is summarized as follows: First, the dynamic characteristics of cluster system running in normal activity are built using Time Series Analysis methods. Second, the fault detection process is implemented by comparing the current running state of cluster system with normal running model. The fault alarm decision is made immediately when the current running state deviates the normal running model. The experiment results show that this mechanism can detect the fault in cluster system in good time.