Methodologies for advance warning of compute cluster problems via statistical analysis: a case study

Authors:
Jim Brandt;Ann Gentile;Jackson Mayo;Philippe Pébay;Diana Roe;David Thompson;Matthew Wong
Affiliations:
Sandia National Laboratories, Livermore, CA, USA;Sandia National Laboratories, Livermore, CA, USA;Sandia National Laboratories, Livermore, CA, USA;Sandia National Laboratories, Livermore, CA, USA;Sandia National Laboratories, Livermore, CA, USA;Sandia National Laboratories, Livermore, CA, USA;Sandia National Laboratories, Livermore, CA, USA
Venue:
Proceedings of the 2009 workshop on Resiliency in high performance
Year:
2009

Citing 4
Cited 5

A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
BlueGene/L Failure Analysis and Prediction Models

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Exploring event correlation for failure prediction in coalitions of clusters

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Using Probabilistic Characterization to Reduce Runtime Faults in HPC Systems

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid

Using Cloud Constructs and Predictive Analysis to Enable Pre-Failure Process Migration in HPC Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Fault oblivious high performance computing with dynamic task replication and substitution

Computer Science - Research and Development
Towards IT systems capable of managing their health

FOCS'10 Proceedings of the 16th Monterey conference on Foundations of computer software: modeling, development, and verification of adaptive systems
Framework for enabling system understanding

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
On the Path to Exascale

International Journal of Distributed Systems and Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

The ability to predict impending failures (hardware or software) on large scale high performance compute (HPC) platforms, augmented by checkpoint mechanisms could drastically increase the scalability of applications and efficiency of platforms. In this paper we present our findings and methodologies employed to date in our search for reliable, advance indicators of failures on a 288 node, 4608 core, Opteron based cluster in production use at Sandia National Laboratories. In support of this effort we have deployed OVIS, a Sandia-developed scalable HPC monitoring, analysis, and visualization tool designed for this purpose. We demonstrate that for a particular error case, statistical analysis using OVIS would enable advanced warning of cluster problems on timescales that would enable application and system administrator response in advance of errors, subsequent system error log reporting, and job failures. This is significant as the utility of detecting such indicators depends on how far in advance of failure they can be recognized and how reliable they are.