Checkpointing Message-Passing Interface(MPI) Parallel Programs

  • Authors:
  • Wei-Jih Li;Jyh-Jong Tsay

  • Affiliations:
  • -;-

  • Venue:
  • PRFTS '97 Proceedings of the 1997 Pacific Rim International Symposium on Fault-Tolerant Systems
  • Year:
  • 1997

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many scientific problems can be distributed on a large number of processos to take advantage of low cost workstations. In a parallel systems, a failure on any processor can halt the computation and requires restarting all applications. Checkpointing is a simple technique to recover the failed execution. Message Passing Interface (MPI) is a standard proposed for writing portable message-passing parallel programs. In this paper, we present a checkpointing implementation for MPI programs, which is transparent, and requires no changes to the application programs. Our implementation combines coordinated, uncoordinated and message logging techniques.