The Average Availability of Parallel Checkpointing Systems and Its Importance in Selecting Runtime Parameters

  • Authors:
  • James S. Plank;Michael G. Thomason

  • Affiliations:
  • -;-

  • Venue:
  • FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

Performance prediction of checkpointing systems in the presence of failures is a well-studied research area. While the literature abounds with performance models of checkpointing systems, none address the issue of selecting runtime parameters other than the optimal checkpointing interval. In particular, the issue of processor allocation is typically ignored. In this paper, we briefly present a performance model for long-running parallel computations that execute with checkpointing enabled. We then discuss how it is relevant to today's parallel computing environments and software, and present case studies of using the model to select runtime parameters.