Practical online failure prediction for Blue Gene/P: Period-based vs event-driven

  • Authors:
  • Li Yu;Ziming Zheng;Zhiling Lan;Susan Coghlan

  • Affiliations:
  • Department of Computer Science, Illinois Institute of Technology;Department of Computer Science, Illinois Institute of Technology;Department of Computer Science, Illinois Institute of Technology;Leadership Computing Facility, Argonne National Laboratory

  • Venue:
  • DSNW '11 Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

To facilitate proactive fault management in large-scale systems such as IBM Blue Gene/P, online failure prediction is of paramount importance. While many techniques have been presented for online failure prediction, questions arise regarding two commonly used approaches: period-based and event-driven. Which one has better accuracy? What is the best observation window (i.e., the time interval used to collect evidence before making a prediction)? How does the lead time (i.e., the time interval from the prediction to the failure occurrence) impact prediction arruracy? To answer these questions, we analyze and compare period-based and event-driven prediction approaches via a Bayesian prediction model. We evaluate these prediction approaches, under a variety of testing parameters, by means of RAS logs collected from a production supercomputer at Argonne National Laboratory. Experimental results show that the period-based Bayesian model and the event-driven Bayesian model can achieve up to 65.0% and 83.8% prediction accuracy, respectively. Furthermore, our sensitivity study indicates that the event-driven approach seems more suitable for proactive fault management in large-scale systems like Blue Gene/P.