Fault injection framework for system resilience evaluation: fake faults for finding future failures

Authors:
Thomas Naughton;Wesley Bland;Geoffroy Vallee;Christian Engelmann;Stephen L. Scott
Affiliations:
Oak Ridge National Laboratory, Oak Ridge, TN, USA;Oak Ridge National Laboratory, Oak Ridge, TN, USA;Oak Ridge National Laboratory, Oak Ridge, TN, USA;Oak Ridge National Laboratory, Oak Ridge, TN, USA;Oak Ridge National Laboratory, Oak Ridge, TN, USA
Venue:
Proceedings of the 2009 workshop on Resiliency in high performance
Year:
2009

Citing 7
Cited 2

Fault Injection Techniques and Tools

Computer
Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers

IEEE Transactions on Software Engineering
Experimental Analysis of the Errors Induced into Linux by Three Fault Injection Techniques

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
A Framework for Assessing Dependability in Distributed Systems with Lightweight Fault Injectors

IPDS '00 Proceedings of the 4th International Computer Performance and Dependability Symposium
What Supercomputers Say: A Study of Five System Logs

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Building a Self-Healing Operating System

DASC '07 Proceedings of the Third IEEE International Symposium on Dependable, Autonomic and Secure Computing
Evaluating fault-tolerant system designs using FAUmachine

Proceedings of the 2007 workshop on Engineering fault tolerant systems

Experimental framework for injecting logic errors in a virtual machine to profile applications for soft error resilience

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

As high-performance computing (HPC) systems increase in size and complexity they become more difficult to manage. The enormous component counts associated with these large systems lead to significant challenges in system reliability and availability. This in turn is driving research into the resilience of large scale systems, which seeks to curb the effects of increased failures at large scales by masking the inevitable faults in these systems. The basic premise being that failure must be accepted as a reality of large scale system and coped with accordingly through system resilience. A key component in the development and evaluation of system resilience techniques is having a means to conduct controlled experiments. A common method for performing such experiments is to generate synthetic faults and study the resulting effects. In this paper we discuss the motivation and our initial use of software fault injection to support the evaluation of resilience for HPC systems. We mention background and related work in the area and discuss the design of a tool to aid in fault injection experiments for both user-space (application-level) and system-level failures.