Evaluating the feasibility of using memory content similarity to improve system resilience

  • Authors:
  • Scott Levy;Patrick G. Bridges;Kurt B. Ferreira;Aidan P. Thompson;Christian Trott

  • Affiliations:
  • University of New Mexico;University of New Mexico;Sandia National Laboratories;Sandia National Laboratories;Sandia National Laboratories

  • Venue:
  • Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Building the next-generation of extreme-scale distributed systems will require overcoming several challenges related to system resilience. As the number of processors in these systems grows, the failure rate increases proportionally. One of the most common sources of failure in large-scale systems is memory errors. In this paper, we propose a novel run-time for transparently exploiting memory content similarity to improve system resilience by reducing the rate at which memory errors lead to node failure. We evaluate the feasibility of this approach by examining memory snapshots collected from eight HPC applications. Based on the characteristics of the similarity that we uncover in these applications, we conclude that our proposed approach shows promise for addressing system resilience in large-scale systems.