Analytical study of migration-enhanced fault tolerance for long-running applications in IFR systems

  • Authors:
  • Yawei Li;Zhiling Lan

  • Affiliations:
  • Department of Computer Science, Illinois Institute of Technology, Chicago, IL, USA;Department of Computer Science, Illinois Institute of Technology, Chicago, IL, USA

  • Venue:
  • International Journal of Parallel, Emergent and Distributed Systems
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Computer systems with increasing failure rate (IFR) are common in practice. For such systems, the literature indicates that aperiodic checkpointing can provide better performance than periodic checkpointing due to its adaptability to the failure process. However, for long-running applications, aperiodic checkpointing suffers from substantial operational overhead due to frequent checkpointing operations as the application proceeds. To address this problem, in this paper, we propose to incorporate just-in-time process migration in addition to aperiodic checkpointing for applications running in an IFR system. The goal is to reduce application execution time in the presence of failures. In particular, we present an analytical study of this migration-enhanced fault tolerance scheme (denoted as migCP) by deriving application completion time by using migCP and further determining the optimal migration locations. We demonstrate, through analytical modelling and empirical studies, that migCP outperforms aperiodic checkpointing under a variety of system parameters.