Fault tolerance using lower fidelity data in adaptive mesh applications

  • Authors:
  • Anshu Dubey;Prateeti Mohapatra;Klaus Weide

  • Affiliations:
  • Lawrence Berkeley National Laboratory, Berkeley, USA;University of Chicago, Chicago, USA;University of Chicago, Chicago, USA

  • Venue:
  • Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many high performance scientific simulation codes use checkpointing for multiple reasons. In addition to having the flexibility to complete the simulation in multiple job submissions, it has also provided an adequate recovery mechanism up to the current generation of platforms. With the advent of million-way parallelism, application codes are looking for additional options for recovery that may or may not be transparent to the applications. In many instances the applications can make the best judgement about the acceptability of the recovered solution. In this paper, we explore one option for recovering from multiple faults in codes using block-structured adaptive mesh refinement (AMR). The AMR codes have easy access to low-fidelity solution in the same physical space where they are also computing higher-fidelity solution. When a fault occurs, this low-fidelity solution can be used to reconstruct the higher fidelity solution in-flight. We report our findings from one implementation of such a strategy in FLASH, a block-structured adaptive mesh refinement community code for simulation of reactive compressible flows. In all our experiments the mechanism proved to be within the error bounds of the considered applications.