Fault Injection for Dependability Validation: A Methodology and Some Applications
IEEE Transactions on Software Engineering
Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers
IEEE Transactions on Software Engineering
Framework for Testing the Fault-Tolerance of Systems Including OS and Network Aspects
HASE '01 The 6th IEEE International Symposium on High-Assurance Systems Engineering: Special Topic: Impact of Networking
Assessing Fault Sensitivity in MPI Applications
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Evaluating fault-tolerant system designs using FAUmachine
Proceedings of the 2007 workshop on Engineering fault tolerant systems
Soft error vulnerability of iterative linear algebra methods
Proceedings of the 22nd annual international conference on Supercomputing
Towards a hardware fault-injection testbed to support reproducible resiliency experiments
Proceedings of the 2009 workshop on Resiliency in high performance
DRAM errors in the wild: a large-scale field study
Communications of the ACM
PRDC '10 Proceedings of the 2010 IEEE 16th Pacific Rim International Symposium on Dependable Computing
Using the TOP500 to trace and project technology and architecture trends
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Evaluating the viability of process replication reliability for exascale systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Virtual-machine-based emulation of future generation high-performance computing systems
International Journal of High Performance Computing Applications
Cooperative Application/OS DRAM fault recovery
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Evaluating operating system vulnerability to memory errors
Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
Fault tolerance is a key obstacle to next generation extreme-scale systems. As systems scale, the Mean Time To Interrupt (MTTI) decreases proportionally. As a result, extreme-scale systems are likely to experience higher rates of failure in the future. To mitigate this, significant research has focused on developing and validating fault tolerance techniques. However, evaluating techniques for withstanding hardware failures at large scale is challenging because replicating those failures on small-scale testbeds is difficult. In this paper, we propose a virtualization-based framework for creating testbeds with unreliable virtual hardware. Our proposed approach allows for comprehensive evaluation of fault tolerance techniques in a broad range of failure regimes. Although there are many other approaches for mimicking unreliable hardware, none of them offer the breadth, scalability, and performance that a virtualization-based solution does.