Evaluating operating system vulnerability to memory errors

  • Authors:
  • Kurt B. Ferreira;Kevin Pedretti;Ron Brightwell;Patrick G. Bridges;David Fiala;Frank Mueller

  • Affiliations:
  • Sandia National Laboratories;Sandia National Laboratories;Sandia National Laboratories;University of New Mexico;North Carolina State University;North Carolina State University

  • Venue:
  • Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Reliability is of great concern to the scalability of extreme-scale systems. Of particular concern are soft errors in main memory, which are a leading cause of failures on current systems and are predicted to be the leading cause on future systems. While great effort has gone into designing algorithms and applications that can continue to make progress in the presence of these errors without restarting, the most critical software running on a node, the operating system (OS), is currently left relatively unprotected. OS resiliency is of particular importance because, though this software typically represents a small footprint of a compute node's physical memory, recent studies show more memory errors in this region of memory than the remainder of the system. In this paper, we investigate the soft error vulnerability of two operating systems used in current and future high-performance computing systems: Kitten, the lightweight kernel developed at Sandia National Laboratories, and CLE, a high-performance Linux-based operating system developed by Cray. For each of these platforms, we outline major structures and subsystems that are vulnerable to soft errors and describe methods that could be used to reconstruct damaged state. Our results show the Kitten lightweight operating system may be an easier target to harden against memory errors due to its smaller memory footprint, largely deterministic state, and simpler system structure.