A case for virtual machine based fault injection in a high-performance computing environment

Authors:
Thomas Naughton;Geoffroy Vallée;Christian Engelmann;Stephen L. Scott
Affiliations:
Oak Ridge National Laboratory, Computer Science and Mathematics Division, Oak Ridge, TN;Oak Ridge National Laboratory, Computer Science and Mathematics Division, Oak Ridge, TN;Oak Ridge National Laboratory, Computer Science and Mathematics Division, Oak Ridge, TN;Oak Ridge National Laboratory, Computer Science and Mathematics Division, Oak Ridge, TN
Venue:
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Year:
2011

Citing 10
Cited 0

Fault Injection Techniques and Tools

Computer
Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers

IEEE Transactions on Software Engineering
Experimental assessment of parallel systems

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
A Framework for Assessing Dependability in Distributed Systems with Lightweight Fault Injectors

IPDS '00 Proceedings of the 4th International Computer Performance and Dependability Symposium
System management software for virtual environments

Proceedings of the 4th international conference on Computing frontiers
FAIL-FCI: Versatile fault injection

Future Generation Computer Systems
Evaluating fault-tolerant system designs using FAUmachine

Proceedings of the 2007 workshop on Engineering fault tolerant systems
The impact of paravirtualized memory hierarchy on linear algebra computational kernels and software

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
A Scalable Tools Communications Infrastructure

HPCS '08 Proceedings of the 2008 22nd International Symposium on High Performance Computing Systems and Applications
CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems

ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large-scale computing platforms provide tremendous capabilities for scientific discovery. As applications and system software scale up to multi- petaflops and beyond to exascale platforms, the occurrence of failure will be much more common. This has given rise to a push in fault-tolerance and resilience research for high-performance computing (HPC) systems. This includes work on log analysis to identify types of failures, enhancements to the Message Passing Interface (MPI) to incorporate fault awareness, and a variety of fault tolerance mechanisms that span redundant computation, algorithm based fault tolerance, and advanced checkpoint/restart techniques. While there is much work to be done on the FT/Resilience mechanisms for such large-scale systems, there is also a profound gap in the tools for experimentation. This gap is compounded by the fact that HPC environments have stringent performance requirements and are often highly customized. The tool chain for these systems are often tailored for the platform and the operating environments typically contain many site/machine specific enhancements. Therefore, it is desirable to maintain a consistent execution environment to minimize end-user (scientist) interruption. The work on system-level virtualization for HPC system offers a unique opportunity to maintain a consistent execution environment via a virtual machine (VM). Recent work on virtualization for HPC has shown that low-overhead, high performance systems can be realized [7, 15]. Virtualization also provides a clean abstraction for building experimental tools for investigation into the effects of failures in HPC and the related research on FT/Resilience mechanisms and policies. In this paper we discuss the motivation for tools to perform fault injection in an HPC context. We also present the design of a new fault injection framework that can leverage virtualization.