Design, Analysis and Performance Evaluation of a New Algorithm for Developing a Fault Tolerant Distributed System

  • Authors:
  • Umasankar Malladi

  • Affiliations:
  • gmail.com

  • Venue:
  • ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Checkpointing and message logging are few of the popular and general-purpose methods for providing fault tolerance in distributed systems. Several variations of their basic schemes have been reported in the literature. Majority of the coordinated checkpointing algorithms, have not addressed about the treatment of lost messages. And also the schemes that consider the improvement of several or all performance factors are very rare. We addressed these issues by developing a new and efficient coordinated checkpointing protocol combined with limited sender-based pessimistic message logging. The significant contribution given by our scheme is that it never creates lost messages. The term limited message logging implies that ours is a periodic checkpointing strategy where the checkpoints and logging of messages takes place only within a specified interval (called. critical interval C.I) Hence it minimizes checkpoint overhead, rollback distance, message logging and even recovery overheads. Output commit latency is also reduced to a considerable extent. Further, while logging the messages, the processes need not be blocked in this scheme. Performance measurement results obtained from our simulations indicate that the proposed strategy outperforms the existing standard techniques- Independent checkpointing, pure sender based pessimistic message logging, and optimistic message logging. Another merit of our protocol is that, it is hardware independent and hence it can be implemented in multi-computer systems irrespective of the architecture, interconnection and routing strategy.