Cooperative Application/OS DRAM fault recovery

  • Authors:
  • Patrick G. Bridges;Mark Hoemmen;Kurt B. Ferreira;Michael A. Heroux;Philip Soltero;Ron Brightwell

  • Affiliations:
  • Department of Computer Science, University of New Mexico, Albuquerque, NM;Sandia National Laboratories, Albuquerque, NM;Department of Computer Science, University of New Mexico, Albuquerque, NM, USA and Sandia National Laboratories, Albuquerque, NM;Sandia National Laboratories, Albuquerque, NM;Department of Computer Science, University of New Mexico, Albuquerque, NM;Sandia National Laboratories, Albuquerque, NM

  • Venue:
  • Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Exascale systems will present considerable fault-tolerance challenges to applications and system software. These systems are expected to suffer several hard and soft errors per day. Unfortunately, many fault-tolerance methods in use, such as rollback recovery, are unsuitable for many expected errors, for example DRAM failures. As a result, applications will need to address these resilience challenges to more effectively utilize future systems. In this paper, we describe work on a cross-layer application / OS framework to handle uncorrected memory errors. We illustrate the use of this framework through its integration with a new fault-tolerant iterative solver within the Trilinos library, and present initial convergence results.