Towards resilient high performance applications through real time reliability metric generation and autonomous failure correction

Authors:
Clayton F. Chandler;Chokchai Leangsuksun;Nathan DeBardeleben
Affiliations:
Louisiana Tech University, Ruston, LA, USA;Louisiana Tech University, Ruston, LA, USA;Los Alamos National Laboratory, Los Alamos, NM, USA
Venue:
Proceedings of the 2009 workshop on Resiliency in high performance
Year:
2009

Citing 20
Cited 0

Problem size, parallel architecture, and optimal speedup

Journal of Parallel and Distributed Computing
Speedup Versus Efficiency in Parallel Systems

IEEE Transactions on Computers
IGOR: a system for program debugging via reversible execution

PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
Achieving application requirements

Distributed systems
Analysis of scalability of parallel algorithms and architectures: a survey

ICS '91 Proceedings of the 5th international conference on Supercomputing
Task Allocation for Maximizing Reliability of Distributed Computer Systems

IEEE Transactions on Computers
An analytical comparison of periodic checkpointing and incremental state saving

PADS '93 Proceedings of the seventh workshop on Parallel and distributed simulation
A case for two-level distributed recovery schemes

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
A Variational Calculus Approach to Optimal Checkpoint Placement

IEEE Transactions on Computers
Improving cluster availability using workstation validation

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
The Average Availability of Parallel Checkpointing Systems and Its Importance in Selecting Runtime Parameters

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
A widget framework for augmented interaction in SCAPE

Proceedings of the 16th annual ACM symposium on User interface software and technology
Availability Modeling and Analysis on High Performance Cluster Computing Systems

ARES '06 Proceedings of the First International Conference on Availability, Reliability and Security
A higher order estimate of the optimum checkpoint interval for restart dumps

Future Generation Computer Systems
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Distribution-Free Checkpoint Placement Algorithms Based on Min-Max Principle

IEEE Transactions on Dependable and Secure Computing
A comparison of MC/DC, MUMCUT and several other coverage criteria for logical decisions

Journal of Systems and Software - Special issue: Quality software
A Survey on Failure Prediction of Large-Scale Server Clusters

SNPD '07 Proceedings of the Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing - Volume 02
Performance implications of failures in large-scale cluster scheduling

JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

One predominant barrier encountered in furthering research and development efforts aimed at facilitating resilient HPC applications is a substantial lack of existing reliability and performance data originating from extreme-scale computing distributions. In order to develop an understanding of how and why highly scaled HPC applications are encountering increasingly frequent performance interruptions, one must conduct extensive trending and analysis on contemporary machines and their associated programs. However, existing HPC application log files are labyrinthine documents that, even with the assistance of intelligent data mining algorithms, translate poorly to human discern. In addition, conventional log filtering processes are limited to execution within a post-mortem, reactive time period, as the enormous size of these documents prevents efficient real time interaction. Thus, there exists a strong need within the HPC field for the provision of accurate-yet-concise real time application information. Moreover, the means of reporting this data must be sufficiently lightweight and non-intrusive, as to successfully-yet-discretely attach itself to the multiple processes running on multiple cores within tens (or in some cases, hundreds) of thousands of compute nodes. Furthermore, this information should in turn be used to facilitate the autonomous correction of application-threatening faults, suspensions, and interruptions. This paper describes a dynamic application instrumentation module (utilizing a combination of Open|SpeedShop software and custom scripting) aimed at successfully achieving these goals.