Using queue structures to improve job reliability

  • Authors:
  • Thomas J. Hacker;Zdzislaw Meglicki

  • Affiliations:
  • Purdue University;Indiana University

  • Venue:
  • Proceedings of the 16th international symposium on High performance distributed computing
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many high performance computing systems today exploit the availability and remarkable performance characteristics of stand alone server systems and the impressive price / performance ratio of commodity components. Small scale HPC systems, in the range from 16 to 64 processors, have enjoyed significant popularity and are an indispensable tool for the research community. Scaling up to hundreds and thousands of processors, however, has exposed operational issues, which include system availability and reliability. In this paper, we explore the impact of individual component reliability rates on the overall reliability of an HPC system. We derive a mathematical model for determining the failure rate of the system, the probability of failure of a job running on a subset of the system, and show how to design a reasonable queue structure to provide a reliable system over abroad job mix. We also explore the impact of reliability and queue structure on checkpoint intervals and recovery. Our results demonstrate that it is possible to design a reliable high performance computing system with very good operational reliability characteristics from a collection of moderately reliable components.