Using unreliable virtual hardware to inject errors in extreme-scale systems

  • Authors:
  • Scott Levy;Matthew G.F. Dosanjh;Patrick G. Bridges;Kurt B. Ferreira

  • Affiliations:
  • University of New Mexico, Albuquerque, NM, USA;University of New Mexico, Albuquerque, NM, USA;University of New Mexico, Albuquerque, NM, USA;Sandia National Laboratories, Albuquerque, NM, USA

  • Venue:
  • Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Fault tolerance is a key obstacle to next generation extreme-scale systems. As systems scale, the Mean Time To Interrupt (MTTI) decreases proportionally. As a result, extreme-scale systems are likely to experience higher rates of failure in the future. To mitigate this, significant research has focused on developing and validating fault tolerance techniques. However, evaluating techniques for withstanding hardware failures at large scale is challenging because replicating those failures on small-scale testbeds is difficult. In this paper, we propose a virtualization-based framework for creating testbeds with unreliable virtual hardware. Our proposed approach allows for comprehensive evaluation of fault tolerance techniques in a broad range of failure regimes. Although there are many other approaches for mimicking unreliable hardware, none of them offer the breadth, scalability, and performance that a virtualization-based solution does.