Another Two-Level Failure Recovery Scheme

  • Authors:
  • Nitin H. Vaidya

  • Affiliations:
  • -

  • Venue:
  • Another Two-Level Failure Recovery Scheme
  • Year:
  • 1994

Quantified Score

Hi-index 0.01

Visualization

Abstract

This report deals with the design and evaulation of a "two-level" failure recovery scheme for distributed systems. In our previous work [30, 32], we motivated a "two-level" recovery approach that tolerates the more probable failures with a low overhead, and less probable failures with possibly higher overhead. The two-level approach can achieve a smaller overhead as compared to traditional recovery schemes. In this report, we present and evaluate a "two-level" recovery scheme that is suitable for a network of workstations, each workstation having a local disk. The recovery scheme presented in the report can tolerate trasient processor failures with a low overhead, while other failures require a larger overhead. The report presents analysis of the average (expected) task completion time using the proposed scheme. This scheme has been implemented on a workstation cluster. Our analysis indicates that the proposed two-level recovery scheme can achieve better performance as compared to existing "one-level" recovery schemes. The report also evaluates the impact of checkpoint latency on the performance of the recovery scheme. To our knowledge, no analysis of the performance impact of checkpoint latency has been carried out previously. Experimental measurements of checkpoint latency for four applications are presented.