Evaluation of fault-tolerant policies using simulation

  • Authors:
  • Anand Tikotekar;Geoffroy Vallee;Thomas Naughton;Stephen L. Scott;Chokchai Leangsuksun

  • Affiliations:
  • Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA;Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA;Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA;Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA;Louisiana Tech University, Ruston, LA 71272, USA

  • Venue:
  • CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Various mechanisms for fault-tolerance (FT) are used today in order to reduce the impact of failures on application execution. In the case of system failure, standard FT mechanisms are checkpoint/restart (for reactive FT) and migration (for pro-active FT). However, each of these mechanisms create an overhead on application execution, overhead that for instance becomes critical on large-scale systems where previous studies have shown that applications may spend more time checkpointing state than performing useful work.