A Meta-Learning Failure Predictor for Blue Gene/L Systems

Authors:
Prashasta Gujrati;Yawei Li;Zhiling Lan;Rajeev Thakur;John White
Affiliations:
Illinois Institute of Technology, USA;Illinois Institute of Technology, USA;Illinois Institute of Technology, USA;Argonne National Laboratory, USA;San Diego Supercomputer Center, USA
Venue:
ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
Year:
2007

Citing 0
Cited 6

Proactive process-level live migration in HPC environments

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
An analysis of clustered failures on large supercomputing systems

Journal of Parallel and Distributed Computing
A study of dynamic meta-learning for failure prediction in large-scale systems

Journal of Parallel and Distributed Computing
Online event correlations analysis in system logs of large-scale cluster systems

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Architecting dependable systems with proactive fault management

Architecting dependable systems VII
Proactive process-level live migration and back migration in HPC environments

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The demand for more computational power in science and engineering has spurred the design and deployment of ever-growing cluster systems. Even though the individual components used in these systems are highly reliable, the presence of large number of components inevitably increases the failure probability of such systems. Successful prediction of potential failures can greatly enhance various fault tolerance mechanisms used in large clusters, thereby mitigating the adverse impact of failures on system productivity and total cost of ownership. In this paper, we present a three-phase failure predictor to automatically process RAS events and further discover failure patterns for prediction in Blue Gene/L systems. In particular, this paper explores the use of metalearning to adaptively integrate base methods with the goal to boost prediction accuracy. Experiments with two RAS logs collected from Blue Gene/L systems at ANL and SDSC demonstrate the effectiveness of the proposed failure predictor.