Distributed Checkpointing Mechanism for a Parallel File System

Authors:
Vítor N. Távora;Luís Moura Silva;João Gabriel Silva
Affiliations:
-;-;-
Venue:
Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Year:
2000

Citing 9
Cited 0

Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
PVM: a framework for parallel distributed computing

Concurrency: Practice and Experience
Real-time, concurrent checkpoint for parallel programs

PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
Introduction to parallel computing

Introduction to parallel computing
GATOSTAR: A Fault Tolerant Load Sharing Facility for Parallel Applications

EDCC-1 Proceedings of the First European Dependable Computing Conference on Dependable Computing
Atomic Transactions

Distributed Systems - Architecture and Implementation, An Advanced Course
Checkpointing in CosMiC: A User-Level Process Migration Environment

PRFTS '97 Proceedings of the 1997 Pacific Rim International Symposium on Fault-Tolerant Systems
Checkpointing and Its Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Libckpt: transparent checkpointing under Unix

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings

Quantified Score

Hi-index	0.00

Visualization

Abstract

Checkpointing techniques have widely been studied in the literature as a way to recover from failures in sequential, distributed and parallel environments. However, most of the checkpointing mechanisms proposed so far focus only on the recovery of the application data. If the application performs some I/O operations to disk files, such schemes may not work correctly, as they do not provide rollback-recovery for the file contents. In this paper, we present a distributed checkpointing mechanism for a Parallel File System that can be integrated with any of the previous application checkpointing algorithms. Three different file checkpointing schemes will be presented, tested in that mechanism and discussed in detail. The distributed mechanism proposed was integrated in PIOUS - a public-domain parallel file system developed for the PVM distributed computing environment.