Online black-box failure prediction for mission critical distributed systems
SAFECOMP'12 Proceedings of the 31st international conference on Computer Safety, Reliability, and Security
Checkpointing algorithms and fault prediction
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
To facilitate proactive fault management in large-scale systems such as IBM Blue Gene/P, online failure prediction is of paramount importance. While many techniques have been presented for online failure prediction, questions arise regarding two commonly used approaches: period-based and event-driven. Which one has better accuracy? What is the best observation window (i.e., the time interval used to collect evidence before making a prediction)? How does the lead time (i.e., the time interval from the prediction to the failure occurrence) impact prediction arruracy? To answer these questions, we analyze and compare period-based and event-driven prediction approaches via a Bayesian prediction model. We evaluate these prediction approaches, under a variety of testing parameters, by means of RAS logs collected from a production supercomputer at Argonne National Laboratory. Experimental results show that the period-based Bayesian model and the event-driven Bayesian model can achieve up to 65.0% and 83.8% prediction accuracy, respectively. Furthermore, our sensitivity study indicates that the event-driven approach seems more suitable for proactive fault management in large-scale systems like Blue Gene/P.