Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
PVM: a framework for parallel distributed computing
Concurrency: Practice and Experience
Real-time, concurrent checkpoint for parallel programs
PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
Introduction to parallel computing
Introduction to parallel computing
GATOSTAR: A Fault Tolerant Load Sharing Facility for Parallel Applications
EDCC-1 Proceedings of the First European Dependable Computing Conference on Dependable Computing
Distributed Systems - Architecture and Implementation, An Advanced Course
Checkpointing in CosMiC: A User-Level Process Migration Environment
PRFTS '97 Proceedings of the 1997 Pacific Rim International Symposium on Fault-Tolerant Systems
Checkpointing and Its Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Hi-index | 0.00 |
Checkpointing techniques have widely been studied in the literature as a way to recover from failures in sequential, distributed and parallel environments. However, most of the checkpointing mechanisms proposed so far focus only on the recovery of the application data. If the application performs some I/O operations to disk files, such schemes may not work correctly, as they do not provide rollback-recovery for the file contents. In this paper, we present a distributed checkpointing mechanism for a Parallel File System that can be integrated with any of the previous application checkpointing algorithms. Three different file checkpointing schemes will be presented, tested in that mechanism and discussed in detail. The distributed mechanism proposed was integrated in PIOUS - a public-domain parallel file system developed for the PVM distributed computing environment.