Distributed Checkpointing Mechanism for a Parallel File System

  • Authors:
  • Vítor N. Távora;Luís Moura Silva;João Gabriel Silva

  • Affiliations:
  • -;-;-

  • Venue:
  • Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

Checkpointing techniques have widely been studied in the literature as a way to recover from failures in sequential, distributed and parallel environments. However, most of the checkpointing mechanisms proposed so far focus only on the recovery of the application data. If the application performs some I/O operations to disk files, such schemes may not work correctly, as they do not provide rollback-recovery for the file contents. In this paper, we present a distributed checkpointing mechanism for a Parallel File System that can be integrated with any of the previous application checkpointing algorithms. Three different file checkpointing schemes will be presented, tested in that mechanism and discussed in detail. The distributed mechanism proposed was integrated in PIOUS - a public-domain parallel file system developed for the PVM distributed computing environment.