Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Information Processing Letters
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Use of Common Time Base for Checkpointing and Rollback Recovery in a Distributed System
IEEE Transactions on Software Engineering
Concurrent Robust Checkpointing and Recovery in Distributed Systems
Proceedings of the Fourth International Conference on Data Engineering
ICPP '98 Proceedings of the 1998 International Conference on Parallel Processing
Hi-index | 0.00 |
Synchronous checkpointing is an attractive approach as it simplifies the process of failure recovery by storing a consistent global checkpoint Efforts have been made to minimize the number of synchronizing messages and the number of checkpoints in such an approach Taking the checkpoint without blocking the underlying computation is another important feature of the checkpointing process In this paper, we present a synchronous checkpointing algorithm which forces a minimum number of nodes to take a checkpoint Underlying computation needs to be blocked partially and only in rare cases The algorithm tolerates the failure of an arbitrary number of nodes during the progress Consistency of the checkpoint is ensured during the checkpointing process and hence no time needs to be spent during recovery.