Checkpointing Message-Passing Interface(MPI) Parallel Programs

Authors:
Wei-Jih Li;Jyh-Jong Tsay
Affiliations:
-;-
Venue:
PRFTS '97 Proceedings of the 1997 Pacific Rim International Symposium on Fault-Tolerant Systems
Year:
1997

Citing 0
Cited 4

User-level checkpoint and recovery for LAM/MPI

ACM SIGOPS Operating Systems Review
Design and Implementation of Multiple Fault-Tolerant MPI over Myrinet (M^3)

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Algorithm-based fault tolerance applied to high performance computing

Journal of Parallel and Distributed Computing
Dynamic failure management for parallel applications on grids

EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many scientific problems can be distributed on a large number of processos to take advantage of low cost workstations. In a parallel systems, a failure on any processor can halt the computation and requires restarting all applications. Checkpointing is a simple technique to recover the failed execution. Message Passing Interface (MPI) is a standard proposed for writing portable message-passing parallel programs. In this paper, we present a checkpointing implementation for MPI programs, which is transparent, and requires no changes to the application programs. Our implementation combines coordinated, uncoordinated and message logging techniques.