Methodologies for advance warning of compute cluster problems via statistical analysis: a case study

  • Authors:
  • Jim Brandt;Ann Gentile;Jackson Mayo;Philippe Pébay;Diana Roe;David Thompson;Matthew Wong

  • Affiliations:
  • Sandia National Laboratories, Livermore, CA, USA;Sandia National Laboratories, Livermore, CA, USA;Sandia National Laboratories, Livermore, CA, USA;Sandia National Laboratories, Livermore, CA, USA;Sandia National Laboratories, Livermore, CA, USA;Sandia National Laboratories, Livermore, CA, USA;Sandia National Laboratories, Livermore, CA, USA

  • Venue:
  • Proceedings of the 2009 workshop on Resiliency in high performance
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

The ability to predict impending failures (hardware or software) on large scale high performance compute (HPC) platforms, augmented by checkpoint mechanisms could drastically increase the scalability of applications and efficiency of platforms. In this paper we present our findings and methodologies employed to date in our search for reliable, advance indicators of failures on a 288 node, 4608 core, Opteron based cluster in production use at Sandia National Laboratories. In support of this effort we have deployed OVIS, a Sandia-developed scalable HPC monitoring, analysis, and visualization tool designed for this purpose. We demonstrate that for a particular error case, statistical analysis using OVIS would enable advanced warning of cluster problems on timescales that would enable application and system administrator response in advance of errors, subsequent system error log reporting, and job failures. This is significant as the utility of detecting such indicators depends on how far in advance of failure they can be recognized and how reliable they are.