A case for two-level distributed recovery schemes

  • Authors:
  • Nitin H. Vaidya

  • Affiliations:
  • Department of Computer Science, Texas A&M University, College Station, TX

  • Venue:
  • Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
  • Year:
  • 1995

Quantified Score

Hi-index 0.01

Visualization

Abstract

Most distributed and multiprocessor recovery schemes proposed in the literature are designed to tolerate arbitrary number of failures. In this paper, we demonstrate that, it is often advantageous to use "two-level" recovery schemes. A two-level recovery scheme tolerates the more probable failures with low performance overhead, while the less probable failures may be tolerated with a higher overhead. By minimizing the overhead for the more frequently occurring failure scenarios, our approach is expected to achieve lower performance overhead (on average) as compared to existing recovery schemes.To demonstrate the advantages of two-level recovery, we evaluate the performance of a recovery scheme that takes two different types of checkpoints, namely, 1-checkpoints and N-checkpoints. A single failure can be tolerated by rolling the system back to a 1-checkpoint, while multiple failure recovery is possible by rolling back to an N-checkpoint. For such a system, we demonstrate that to minimize the average overhead, it is often necessary to take both 1-checkpoints and N-checkpoints.While the conclusions of this paper are intuitive, the work on design of appropriate recovery schemes is lacking. The objective of this paper is to motivate research into recovery schemes that can provide multiple levels of fault tolerance.