PVM: a framework for parallel distributed computing
Concurrency: Practice and Experience
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Using MPI: portable parallel programming with the message-passing interface
Using MPI: portable parallel programming with the message-passing interface
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
A low-overhead recovery technique using quasi-synchronous checkpointing
ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery
Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery
MPVM: A Migration Transparent Version of PVM
MPVM: A Migration Transparent Version of PVM
Hi-index | 0.00 |
SFT algorithm, a consistent checkpointing algorithm with shorter freezing time, is presented in this paper. SFT is able to implement fault-tolerance in distributed systems. The features of the algorithm include shorter freezing time, lower overhead, and simple roll backing. To reduce checkpointing time, a special control message (Munblock) is used to ensure that at any given time a process can respond the checkpoint event quickly. Moreover, a main memory algorithm is used to improve concurrency of checkpointing. By using SFT algorithm, the freezing time resulted by checkpointing is less than 0.03s. The control message number of SFT is only O (n).