Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
ACM Transactions on Computer Systems (TOCS)
Recovery in distributed systems using optimistic message logging and check-pointing
Journal of Algorithms
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
On Coordinated Checkpointing in Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Consistent global states of distributed systems: fundamental concepts and mechanisms
Distributed systems (2nd Ed.)
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Message Logging: Pessimistic, Optimistic, Causal, and Optimal
IEEE Transactions on Software Engineering
The Performance of Coordinated and Independent Checkpointing
IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
ICOIN '98 Proceedings of the 13th International Conference on Information Networking
Message logging: pessimistic, optimistic, and causal
ICDCS '95 Proceedings of the 15th International Conference on Distributed Computing Systems
Distributed system fault tolerance using message logging and checkpointing
Distributed system fault tolerance using message logging and checkpointing
State Restoration in Systems of Communicating Processes
IEEE Transactions on Software Engineering
Hi-index | 0.00 |
Checkpointing and message logging are few of the popular and general-purpose methods for providing fault tolerance in distributed systems. Several variations of their basic schemes have been reported in the literature. Majority of the coordinated checkpointing algorithms, have not addressed about the treatment of lost messages. And also the schemes that consider the improvement of several or all performance factors are very rare. We addressed these issues by developing a new and efficient coordinated checkpointing protocol combined with limited sender-based pessimistic message logging. The significant contribution given by our scheme is that it never creates lost messages. The term limited message logging implies that ours is a periodic checkpointing strategy where the checkpoints and logging of messages takes place only within a specified interval (called. critical interval C.I) Hence it minimizes checkpoint overhead, rollback distance, message logging and even recovery overheads. Output commit latency is also reduced to a considerable extent. Further, while logging the messages, the processes need not be blocked in this scheme. Performance measurement results obtained from our simulations indicate that the proposed strategy outperforms the existing standard techniques- Independent checkpointing, pure sender based pessimistic message logging, and optimistic message logging. Another merit of our protocol is that, it is hardware independent and hence it can be implemented in multi-computer systems irrespective of the architecture, interconnection and routing strategy.