Complete Process Recovery: Using Vector Time to Handle Multiple Failures in Distributed Systems

  • Authors:
  • Golden G. Richard III;Mukesh Singhal

  • Affiliations:
  • -;-

  • Venue:
  • IEEE Parallel & Distributed Technology: Systems & Technology
  • Year:
  • 1997

Quantified Score

Hi-index 0.00

Visualization

Abstract

As the number of processors in a distributed system and the running time of applications increases, the likelihood of processor failure increases. The failure of even a single processor can mandate restarting an entire application from scratch. With the appropriate recovery mechanisms, distributed applications can survive multiple failures and avoid complete restarts. The authors present a set of distributed recovery techniques called CPR (Complete Process Recovery) that utilize vector time to handle failures and address both consistent-state restoration and the associated message-handling issues. The latter is important, because some recovery techniques delegate the handling of lost or duplicate messages to the message-transport mechanism. This requires the ability to checkpoint the network layer (a serious restriction), because the network is generally unaware of the anomalous messages induced by process failure and recovery. CPR requires nonfailed processes to roll back at most once in response to a single failure, and has low message complexity. Processes required to roll back after a failure do so concurrently, which substantially decreases recovery delay after a failure has occurred.