Design, Analysis and Performance Evaluation of a New Algorithm for Developing a Fault Tolerant Distributed System

Authors:
Umasankar Malladi
Affiliations:
gmail.com
Venue:
ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Year:
2006

Citing 14
Cited 0

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Fault tolerance under UNIX

ACM Transactions on Computer Systems (TOCS)
Recovery in distributed systems using optimistic message logging and check-pointing

Journal of Algorithms
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
On Coordinated Checkpointing in Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Consistent global states of distributed systems: fundamental concepts and mechanisms

Distributed systems (2nd Ed.)
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Message Logging: Pessimistic, Optimistic, Causal, and Optimal

IEEE Transactions on Software Engineering
The Performance of Coordinated and Independent Checkpointing

IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
Communication Pattern Based Checkpointing Coordination for Fault-Tolerant Distributed Computing Systems

ICOIN '98 Proceedings of the 13th International Conference on Information Networking
Message logging: pessimistic, optimistic, and causal

ICDCS '95 Proceedings of the 15th International Conference on Distributed Computing Systems
Distributed system fault tolerance using message logging and checkpointing

Distributed system fault tolerance using message logging and checkpointing
State Restoration in Systems of Communicating Processes

IEEE Transactions on Software Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Checkpointing and message logging are few of the popular and general-purpose methods for providing fault tolerance in distributed systems. Several variations of their basic schemes have been reported in the literature. Majority of the coordinated checkpointing algorithms, have not addressed about the treatment of lost messages. And also the schemes that consider the improvement of several or all performance factors are very rare. We addressed these issues by developing a new and efficient coordinated checkpointing protocol combined with limited sender-based pessimistic message logging. The significant contribution given by our scheme is that it never creates lost messages. The term limited message logging implies that ours is a periodic checkpointing strategy where the checkpoints and logging of messages takes place only within a specified interval (called. critical interval C.I) Hence it minimizes checkpoint overhead, rollback distance, message logging and even recovery overheads. Output commit latency is also reduced to a considerable extent. Further, while logging the messages, the processes need not be blocked in this scheme. Performance measurement results obtained from our simulations indicate that the proposed strategy outperforms the existing standard techniques- Independent checkpointing, pure sender based pessimistic message logging, and optimistic message logging. Another merit of our protocol is that, it is hardware independent and hence it can be implemented in multi-computer systems irrespective of the architecture, interconnection and routing strategy.