Practical online failure prediction for Blue Gene/P: Period-based vs event-driven

Authors:
Li Yu;Ziming Zheng;Zhiling Lan;Susan Coghlan
Affiliations:
Department of Computer Science, Illinois Institute of Technology;Department of Computer Science, Illinois Institute of Technology;Department of Computer Science, Illinois Institute of Technology;Leadership Computing Facility, Argonne National Laboratory
Venue:
DSNW '11 Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops
Year:
2011

Citing 0
Cited 2

Online black-box failure prediction for mission critical distributed systems

SAFECOMP'12 Proceedings of the 31st international conference on Computer Safety, Reliability, and Security
Checkpointing algorithms and fault prediction

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

To facilitate proactive fault management in large-scale systems such as IBM Blue Gene/P, online failure prediction is of paramount importance. While many techniques have been presented for online failure prediction, questions arise regarding two commonly used approaches: period-based and event-driven. Which one has better accuracy? What is the best observation window (i.e., the time interval used to collect evidence before making a prediction)? How does the lead time (i.e., the time interval from the prediction to the failure occurrence) impact prediction arruracy? To answer these questions, we analyze and compare period-based and event-driven prediction approaches via a Bayesian prediction model. We evaluate these prediction approaches, under a variety of testing parameters, by means of RAS logs collected from a production supercomputer at Argonne National Laboratory. Experimental results show that the period-based Bayesian model and the event-driven Bayesian model can achieve up to 65.0% and 83.8% prediction accuracy, respectively. Furthermore, our sensitivity study indicates that the event-driven approach seems more suitable for proactive fault management in large-scale systems like Blue Gene/P.