Recovery in distributed systems using optimistic message logging and check-pointing
Journal of Algorithms
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Transparent checkpointing and rollback recovery mechanism for Windows NT applications
ACM SIGOPS Operating Systems Review
Efficient Rollback-Recovery Technique in Distributed Computing Systems
IEEE Transactions on Parallel and Distributed Systems
Fault Tolerance for Clusters of Workstations
Revised Papers from a Workshop on Hardware and Software Architectures for Fault Tolerance
Hi-index | 0.00 |
This paper presents a high availability run-time system----ChaRM-NT, a Checkpoint-based Rollback recovery system for parallel applications on a cluster of computers (COCs) based on Windows NT. ChaRM-NT implements an insert-mode, reduced coordinated checkpointing and rollback recovery (CRR) mechanism. Owing to the above techniques, ChaRM-NT can recover parallel applications from the checkpointing file upon system failures. In addition we have implemented a new coordinated checkpointing algorithm that only requires O(n) control messages where n is the number of participating processes. Independent on message passing environments (MPEs) ChaRM-NT implements a portable single process CRR library. Therefore it is very easy to adapt to different MPEs and it supports PVM and MPI for NT now.