An efficient protocol for checkpoint-based failure recovery in distributed systems

  • Authors:
  • D. Goswami;S. Sahu

  • Affiliations:
  • Indian Institute of Technology Guwahati, North Guwahati, India;Indian Institute of Technology Guwahati, North Guwahati, India

  • Venue:
  • ICDCIT'04 Proceedings of the First international conference on Distributed Computing and Internet Technology
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Synchronous checkpointing is an attractive approach as it simplifies the process of failure recovery by storing a consistent global checkpoint Efforts have been made to minimize the number of synchronizing messages and the number of checkpoints in such an approach Taking the checkpoint without blocking the underlying computation is another important feature of the checkpointing process In this paper, we present a synchronous checkpointing algorithm which forces a minimum number of nodes to take a checkpoint Underlying computation needs to be blocked partially and only in rare cases The algorithm tolerates the failure of an arbitrary number of nodes during the progress Consistency of the checkpoint is ensured during the checkpointing process and hence no time needs to be spent during recovery.