Fast parallel algorithms for short-range molecular dynamics
Journal of Computational Physics
A Rectilinear-Monotone Polygonal Fault Block Model for Fault-Tolerant Minimal Routing in Mesh
IEEE Transactions on Computers
The Cactus Code: A Problem Solving Environment for the Grid
HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
Scalable, fault tolerant membership for MPI tasks on HPC systems
Proceedings of the 20th annual international conference on Supercomputing
Algorithm-based fault tolerance applied to high performance computing
Journal of Parallel and Distributed Computing
Replication-Based Fault Tolerance for MPI Applications
IEEE Transactions on Parallel and Distributed Systems
Future Generation Computer Systems
Local adaptive mesh refinement for shock hydrodynamics
Journal of Computational Physics
Design and modeling of a non-blocking checkpointing system
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
An evaluation of user-level failure mitigation support in MPI
EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Hi-index | 0.00 |
Many high performance scientific simulation codes use checkpointing for multiple reasons. In addition to having the flexibility to complete the simulation in multiple job submissions, it has also provided an adequate recovery mechanism up to the current generation of platforms. With the advent of million-way parallelism, application codes are looking for additional options for recovery that may or may not be transparent to the applications. In many instances the applications can make the best judgement about the acceptability of the recovered solution. In this paper, we explore one option for recovering from multiple faults in codes using block-structured adaptive mesh refinement (AMR). The AMR codes have easy access to low-fidelity solution in the same physical space where they are also computing higher-fidelity solution. When a fault occurs, this low-fidelity solution can be used to reconstruct the higher fidelity solution in-flight. We report our findings from one implementation of such a strategy in FLASH, a block-structured adaptive mesh refinement community code for simulation of reactive compressible flows. In all our experiments the mechanism proved to be within the error bounds of the considered applications.