Performance analysis of checkpointing strategies

  • Authors:
  • Asser N. Tantawi;Manfred Ruschitzka

  • Affiliations:
  • IBM Thomas J. Watson Research Center, Yorktown Heights, New York;Department of Electrical and Computer Engineering, University of California, Davis, California

  • Venue:
  • SIGMETRICS '83 Proceedings of the 1983 ACM SIGMETRICS conference on Measurement and modeling of computer systems
  • Year:
  • 1983

Quantified Score

Hi-index 0.03

Visualization

Abstract

A widely used error recovery technique in database systems is the rollback and recovery technique. This technique saves periodically the state of the system and records all activities on a reliable log tape. The operation of saving the system state is called checkpointing. The elapsed time between two consecutive checkpointing operations is called checkpointing interval. When the system fails, the recovery process uses the log tape and the state saved at the most recent checkpoint to bring the system to the correct state that preceded the failure. This process is called error recovery and consists of loading the most recent state and then reprocessing all the activities, stored on the log tape, that took place since the most recent checkpoint and prior to failure. Former models of rollback and recovery assumed Poisson failures and fixed (or exponential) checkpointing intervals. Extending these models, we consider general failure distributions. We also allow checkpointing intervals to depend on the reprocessing time (the time elapsed between the most recent checkpoint prior to failure and the time of failure) and the failure distribution. Furthermore, failures may occur during the checkpointing and error recovery. Our general model unifies a variety of models that have previously been investigated. We denote by Fi; and t(Fi), i &equil; 1, 2, ..., the ith failure that occurs during normal processing (not during error recovery) and the time of its occurrence, respectively. We refer to the time period Li &equil; t(Fi+1) − t(Fi), i &equil; 1, 2, ..., as the ith cycle whose length is Li. It consists of two portions: the total error recovery time and the normal processing time. The reprocessing time associated with failure Fi is denoted by Yi−1. Since the variables of the ith cycle depend at most on one variable of the (i − 1)st cycle, namely Yi−1, the stochastic process of the reprocessing time {Yi; i≥0} is a Markov process. We obtain the transition probability density function and the stationary distribution of this process. The performance of the system is measured by the availability, the fraction of time the system is not checkpointing or recovering from errors. In equilibrium, the system availability is expressed as the ratio of the mean production time (normal processing time excluding checkpointing time) during a cycle and the mean length of the cycle. We obtain a general expression for the system availability in our general model. The checkpointing strategy is characterized by the sequence of checkpointing intervals. For the well-known equidistant checkpointing strategy, in which the checkpointing intervals are constant, we find that the resulting system availability depends only on the mean of the failure distribution. We define a checkpointing strategy as failure-dependent if the sequence of checkpointing intervals depends on the failure distribution. Checkpointing strategies that result in a checkpointing operation immediately after error recovery are called reprocessing-independent strategies. We then introduce a novel checkpointing strategy, the equicost strategy, which is failure-dependent and reprocessing-independent. This strategy suggests that a checkpointing operation is to be performed whenever the mean reprocessing cost equals the mean checkpointing cost. Interestingly, the equicost strategy leads to fixed checkpointing intervals for Poisson failures. We compare the maximum system availability resulting from the equidistant and the equicost checkpointing strategies under Weibull distributions which are good approximations of actual failure distributions. Computational results based on Weibull failure distributions (both increasing and decreasing failure rates) show that the equicost strategy achieves higher system availability than the equidistant strategy which is known to be optimal under Poisson failures.