BlueGene/L Failure Analysis and Prediction Models

  • Authors:
  • Yinglung Liang;Yanyong Zhang;Anand Sivasubramaniam;Morris Jette;Ramendra Sahoo

  • Affiliations:
  • Rutgers University;Rutgers University;Penn State University;Lawrence Livermore National Laboratory;IBM T. J. Watson Research Center

  • Venue:
  • DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can accommodate as many as 128K processors. One of the challenges when designing and deploying these systems in a production setting is the need to take failure occurrences, whether it be in the hardware or in the software, into account. Ear- lier work has shown that conventional runtime fault- tolerant techniques such as periodic checkpointing are not effective to the emerging systems. Instead, the ability to predict failure occurrences can help develop more effective checkpointing strategies. Failure predic- tion has long been regarded as a challenging research problem, mainly due to the lack of realistic failure data from actual production systems. In this study, we have collected RAS event logs from BlueGene/L over a pe- riod of more than 100 days. We have investigated the characteristics of fatal failure events, as well as the correlation between fatal events and non-fatal events. Based on the observations, we have developed three simple yet effective failure prediction methods, which can predict around 80% of the memory and network failures, and 47% of the application I/O failures.