Towards resilient high performance applications through real time reliability metric generation and autonomous failure correction

  • Authors:
  • Clayton F. Chandler;Chokchai Leangsuksun;Nathan DeBardeleben

  • Affiliations:
  • Louisiana Tech University, Ruston, LA, USA;Louisiana Tech University, Ruston, LA, USA;Los Alamos National Laboratory, Los Alamos, NM, USA

  • Venue:
  • Proceedings of the 2009 workshop on Resiliency in high performance
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

One predominant barrier encountered in furthering research and development efforts aimed at facilitating resilient HPC applications is a substantial lack of existing reliability and performance data originating from extreme-scale computing distributions. In order to develop an understanding of how and why highly scaled HPC applications are encountering increasingly frequent performance interruptions, one must conduct extensive trending and analysis on contemporary machines and their associated programs. However, existing HPC application log files are labyrinthine documents that, even with the assistance of intelligent data mining algorithms, translate poorly to human discern. In addition, conventional log filtering processes are limited to execution within a post-mortem, reactive time period, as the enormous size of these documents prevents efficient real time interaction. Thus, there exists a strong need within the HPC field for the provision of accurate-yet-concise real time application information. Moreover, the means of reporting this data must be sufficiently lightweight and non-intrusive, as to successfully-yet-discretely attach itself to the multiple processes running on multiple cores within tens (or in some cases, hundreds) of thousands of compute nodes. Furthermore, this information should in turn be used to facilitate the autonomous correction of application-threatening faults, suspensions, and interruptions. This paper describes a dynamic application instrumentation module (utilizing a combination of Open|SpeedShop software and custom scripting) aimed at successfully achieving these goals.