User-level checkpoint and recovery for LAM/MPI
ACM SIGOPS Operating Systems Review
Design and Implementation of Multiple Fault-Tolerant MPI over Myrinet (M^3)
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Algorithm-based fault tolerance applied to high performance computing
Journal of Parallel and Distributed Computing
Dynamic failure management for parallel applications on grids
EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing
Hi-index | 0.00 |
Many scientific problems can be distributed on a large number of processos to take advantage of low cost workstations. In a parallel systems, a failure on any processor can halt the computation and requires restarting all applications. Checkpointing is a simple technique to recover the failed execution. Message Passing Interface (MPI) is a standard proposed for writing portable message-passing parallel programs. In this paper, we present a checkpointing implementation for MPI programs, which is transparent, and requires no changes to the application programs. Our implementation combines coordinated, uncoordinated and message logging techniques.