Using checkpointing to recover from poor multi-site parallel job scheduling decisions

  • Authors:
  • William M. Jones

  • Affiliations:
  • United States Naval Academy, Annapolis, MD

  • Venue:
  • Proceedings of the 5th international workshop on Middleware for grid computing: held at the ACM/IFIP/USENIX 8th International Middleware Conference
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recent research in multi-site parallel job scheduling leverages user-provided estimates of job communication characteristics to effectively partition the job across multiple clusters. Previous research addressed the impact of inaccuracies in these estimates on overall system performance and found that multi-site scheduling techniques benefit from these estimates, even in the presence of considerable inaccuracy. While these results are encouraging, there are many instances where these errors result in poor scheduling decisions that cause network over-subscription. This situation can lead to significantly degraded application runtime performance and turnaround time. In this paper, we explore the use of job checkpointing to selectively stop offending jobs in order to alleviate network congestion and subsequently restart them when (and where) sufficient network resources are available. We then characterize the conditions and the extent to which checkpointing improves overall performance. We demonstrate that checkpointing is beneficial even when the overhead of doing so is costly.