A Framework for Assessing Dependability in Distributed Systems with Lightweight Fault Injectors
IPDS '00 Proceedings of the 4th International Computer Performance and Dependability Symposium
Terrestrial-Based Radiation Upsets: A Cautionary Tale
FCCM '05 Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
QEMU, a fast and portable dynamic translator
ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Fault injection framework for system resilience evaluation: fake faults for finding future failures
Proceedings of the 2009 workshop on Resiliency in high performance
International Journal of High Performance Computing Applications
The International Exascale Software Project roadmap
International Journal of High Performance Computing Applications
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Using unreliable virtual hardware to inject errors in extreme-scale systems
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Hi-index | 0.00 |
As the high performance computing (HPC) community continues to push for ever larger machines, reliability remains a serious obstacle. Further, as feature size and voltages decrease, the rate of transient soft errors is on the rise. HPC programmers of today have to deal with these faults to a small degree and it is expected this will only be a larger problem as systems continue to scale. In this paper we present SEFI, the Soft Error Fault Injection framework, a tool for profiling software for its susceptibility to soft errors. In particular, we focus in this paper on logic soft error injection. Using the open source virtual machine and processor emulator (QEMU), we demonstrate modifying emulated machine instructions to introduce soft errors. We conduct experiments by modifying the virtual machine itself in a way that does not require intimate knowledge of the tested application. With this technique, we show that we are able to inject simulated soft errors in the logic operations of a target application without affecting other applications or the operating system sharing the VM. We present some initial results and discuss where we think this work will be useful in next generation hardware/software co-design.