An analysis of clustered failures on large supercomputing systems

  • Authors:
  • Thomas J. Hacker;Fabian Romero;Christopher D. Carothers

  • Affiliations:
  • Computer & Information Technology, Purdue University, 401 North Grant Street, West Lafayette, IN 47907, USA and Discovery Park Cyber Center, Purdue University, West Lafayette, IN 47907, USA;Computer & Information Technology, Purdue University, 401 North Grant Street, West Lafayette, IN 47907, USA;Department of Computer Science, Rensselaer Polytechnic Institute, 110 8th Street, Troy, NY 12180, USA

  • Venue:
  • Journal of Parallel and Distributed Computing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Large supercomputers are built today using thousands of commodity components, and suffer from poor reliability due to frequent component failures. The characteristics of failure observed on large-scale systems differ from smaller scale systems studied in the past. One striking difference is that system events are clustered temporally and spatially, which complicates failure analysis and application design. Developing a clear understanding of failures for large-scale systems is a critical step in building more reliable systems and applications that can better tolerate and recover from failures. In this paper, we analyze the event logs of two large IBM Blue Gene systems, statistically characterize system failures, present a model for predicting the probability of node failure, and assess the effects of differing rates of failure on job failures for large-scale systems. The work presented in this paper will be useful for developers and designers seeking to deploy efficient and reliable petascale systems.