A practical failure prediction with location and lead time for Blue Gene/P

  • Authors:
  • Ziming Zheng;Zhiling Lan;Rinku Gupta;Susan Coghlan;Peter Beckman

  • Affiliations:
  • Department of Computer Science, Illinois Institute of Technology;Department of Computer Science, Illinois Institute of Technology;Mathematics and Computer Science Division, Argonne National Laboratory;Argonne Leadership Computing Facility, Argonne National Laboratory;Mathematics and Computer Science Division, Argonne National Laboratory

  • Venue:
  • DSNW '10 Proceedings of the 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W)
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Analyzing, understanding and predicting failure is of paramount importance to achieve effective fault management. While various fault prediction methods have been studied in the past, many of them are not practical for use in real systems. In particular, they fail to address two crucial issues: one is to provide location information (i.e., the components where the failure is expected to occur on) and the other is to provide sufficient lead time (i.e., the time interval preceding the time of failure occurrence). In this paper, we first refine the widely-used metrics for evaluating prediction accuracy by including location as well as lead time. We, then, present a practical failure prediction mechanism for IBM Blue Gene systems. A Genetic Algorithm based method is exploited, which takes into consideration the location and the lead time for failure prediction. We demonstrate the effectiveness of this mechanism by means of real failure logs and job logs collected from the IBM Blue Gene/P system at Argonne National Laboratory. Our experiments show that the presented method can significantly improve fault management (e.g., to reduce service unit loss by up to 52.4%) by incorporating location and lead time information in the prediction.