Quantifying rollback propagation in distributed checkpointing

  • Authors:
  • Adnan Agbaria;Hagit Attiya;Roy Friedman;Roman Vitenberg

  • Affiliations:
  • Department of Computer Science, Technion - Israel Institute of Technology, Haifa 32000, Israel;Department of Computer Science, Technion - Israel Institute of Technology, Haifa 32000, Israel;Department of Computer Science, Technion - Israel Institute of Technology, Haifa 32000, Israel;Department of Computer Science, Technion - Israel Institute of Technology, Haifa 32000, Israel

  • Venue:
  • Journal of Parallel and Distributed Computing
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper proposes a new classification of executions with checkpoints based on the amount of rollback during recovery. Specifically, an execution is k-rollback, if k indicates the maximal number of checkpoints that have to be rolled back. It is shown that coordinated checkpointing, SZPF, and ZPF are 1-rollback, while ZCF is (n - 1)-rollback, where n is the number of participants in an execution.A new class of executions, called d-bounded cycles (in short, d-BC), is introduced, and is shown to be ((n - 1)ċ d)-rollback (ZCF is a special case of d-BC for d = 1).Finally, a protocol is presented whose executions are d-bounded cycles. A nice property of this protocol is that it does not impose any control information overhead on application messages, yet sends only a few control messages of its own. Moreover, the protocol maintains information that enables very efficient discovery of a recent recovery line that existed shortly before the failure.