Comparing checkpoint and rollback recovery schemes in a cluster system

  • Authors:
  • Noriaki Bessho;Tadashi Dohi

  • Affiliations:
  • Department of Information Engineering, Graduate School of Engineering, Hiroshima University, Higashi-Hiroshima, Japan;Department of Information Engineering, Graduate School of Engineering, Hiroshima University, Higashi-Hiroshima, Japan

  • Venue:
  • ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Cluster systems play a central role to realize high performance computing with relatively low cost, and at the same time are necessary the fault-tolerance features for the practical use. In this paper we develop stochastic models to evaluate the expected total recovery overhead for a cluster computing system with three well-known checkpoint and rollback recovery schemes; checkpoint mirroring, central file server checkpointing and skewed checkpointing, where the fault latency time after a system failure is given by a random variable. In general, since the multi-node failure as well as single-node failure may occur in the cluster system, it is not so easy to obtain the closed form of expected total recovery overhead. Based on a simple failure model, we do this by listing up all the possible combinations of probabilistic events caused by the multi-node failure. Further we compare the respective expected total recovery overhead with different checkpoint and rollback recovery schemes, and evaluate quantitatively the effectiveness of these schemes.