Checkpoint and Rollback in Asynchronous Distributed Systems

  • Authors:
  • H. Higaki;K. Shima;T. Tachikawa;M. Takizawa

  • Affiliations:
  • -;-;-;-

  • Venue:
  • INFOCOM '97 Proceedings of the INFOCOM '97. Sixteenth Annual Joint Conference of the IEEE Computer and Communications Societies. Driving the Information Revolution
  • Year:
  • 1997

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper proposes a novel algorithm for taking checkpoints and rolling back the processes for recovery in asynchronous distributed systems. The algorithm has the following properties: (1) multiple processes can simultaneously initiate the checkpointing; (2) no additional message is transmitted for taking checkpoints; (3) a set of local checkpoints taken by multiple processes denotes a consistent global state; (4) multiple processes can initiate simultaneously the rollback recovery; (5) the minimum number of processes are rolled back; and (6) each process is rolled back asynchronously. The number of messages for rolling back the processes is O(l) where l is the number of channels. Therefore, the system is kept highly available by the algorithm presented.