Fault tolerance using lower fidelity data in adaptive mesh applications

Authors:
Anshu Dubey;Prateeti Mohapatra;Klaus Weide
Affiliations:
Lawrence Berkeley National Laboratory, Berkeley, USA;University of Chicago, Chicago, USA;University of Chicago, Chicago, USA
Venue:
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Year:
2013

Citing 11
Cited 0

Fast parallel algorithms for short-range molecular dynamics

Journal of Computational Physics
A Rectilinear-Monotone Polygonal Fault Block Model for Fault-Tolerant Minimal Routing in Mesh

IEEE Transactions on Computers
The Cactus Code: A Problem Solving Environment for the Grid

HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
Scalable, fault tolerant membership for MPI tasks on HPC systems

Proceedings of the 20th annual international conference on Supercomputing
Algorithm-based fault tolerance applied to high performance computing

Journal of Parallel and Distributed Computing
Replication-Based Fault Tolerance for MPI Applications

IEEE Transactions on Parallel and Distributed Systems
Adapting grid applications to safety using fault-tolerant methods: Design, implementation and evaluations

Future Generation Computer Systems
Extensible component-based architecture for FLASH, a massively parallel, multiphysics simulation code

Parallel Computing
Local adaptive mesh refinement for shock hydrodynamics

Journal of Computational Physics
Design and modeling of a non-blocking checkpointing system

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
An evaluation of user-level failure mitigation support in MPI

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many high performance scientific simulation codes use checkpointing for multiple reasons. In addition to having the flexibility to complete the simulation in multiple job submissions, it has also provided an adequate recovery mechanism up to the current generation of platforms. With the advent of million-way parallelism, application codes are looking for additional options for recovery that may or may not be transparent to the applications. In many instances the applications can make the best judgement about the acceptability of the recovered solution. In this paper, we explore one option for recovering from multiple faults in codes using block-structured adaptive mesh refinement (AMR). The AMR codes have easy access to low-fidelity solution in the same physical space where they are also computing higher-fidelity solution. When a fault occurs, this low-fidelity solution can be used to reconstruct the higher fidelity solution in-flight. We report our findings from one implementation of such a strategy in FLASH, a block-structured adaptive mesh refinement community code for simulation of reactive compressible flows. In all our experiments the mechanism proved to be within the error bounds of the considered applications.