Using unreliable virtual hardware to inject errors in extreme-scale systems

Authors:
Scott Levy;Matthew G.F. Dosanjh;Patrick G. Bridges;Kurt B. Ferreira
Affiliations:
University of New Mexico, Albuquerque, NM, USA;University of New Mexico, Albuquerque, NM, USA;University of New Mexico, Albuquerque, NM, USA;Sandia National Laboratories, Albuquerque, NM, USA
Venue:
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Year:
2013

Citing 19
Cited 0

Fault Injection for Dependability Validation: A Methodology and Some Applications

IEEE Transactions on Software Engineering
Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers

IEEE Transactions on Software Engineering
Framework for Testing the Fault-Tolerance of Systems Including OS and Network Aspects

HASE '01 The 6th IEEE International Symposium on High-Assurance Systems Engineering: Special Topic: Impact of Networking
Assessing Fault Sensitivity in MPI Applications

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Evaluating fault-tolerant system designs using FAUmachine

Proceedings of the 2007 workshop on Engineering fault tolerant systems
Soft error vulnerability of iterative linear algebra methods

Proceedings of the 22nd annual international conference on Supercomputing
Towards a hardware fault-injection testbed to support reproducible resiliency experiments

Proceedings of the 2009 workshop on Resiliency in high performance
DRAM errors in the wild: a large-scale field study

Communications of the ACM
Customizing Virtual Machine with Fault Injector by Integrating with SpecC Device Model for a Software Testing Environment D-Cloud

PRDC '10 Proceedings of the 2010 IEEE 16th Pacific Rim International Symposium on Dependable Computing
Using the TOP500 to trace and project technology and architecture trends

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Evaluating the viability of process replication reliability for exascale systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Virtual-machine-based emulation of future generation high-performance computing systems

International Journal of High Performance Computing Applications
Cooperative Application/OS DRAM fault recovery

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Experimental framework for injecting logic errors in a virtual machine to profile applications for soft error resilience

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Evaluating operating system vulnerability to memory errors

Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Fault tolerance is a key obstacle to next generation extreme-scale systems. As systems scale, the Mean Time To Interrupt (MTTI) decreases proportionally. As a result, extreme-scale systems are likely to experience higher rates of failure in the future. To mitigate this, significant research has focused on developing and validating fault tolerance techniques. However, evaluating techniques for withstanding hardware failures at large scale is challenging because replicating those failures on small-scale testbeds is difficult. In this paper, we propose a virtualization-based framework for creating testbeds with unreliable virtual hardware. Our proposed approach allows for comprehensive evaluation of fault tolerance techniques in a broad range of failure regimes. Although there are many other approaches for mimicking unreliable hardware, none of them offer the breadth, scalability, and performance that a virtualization-based solution does.