Predicting bounds on queuing delay for batch-scheduled parallel machines

  • Authors:
  • John Brevik;Daniel Nurmi;Rich Wolski

  • Affiliations:
  • University of California, Santa Barbara, CA;University of California, Santa Barbara, CA;University of California, Santa Barbara, CA

  • Venue:
  • Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Most space-sharing parallel computers presently operated by high-performance computing centers use batch-queuing systems to manage processor allocation. In many cases, users wishing to use these batch-queued resources have accounts at multiple sites and have the option of choosing at which site or sites to submit a parallel job. In such a situation, the amount of time a user's job will wait in any one batch queue can significantly impact the overall time a user waits from job submission to job completion. In this work, we explore a new method for providing end-users with predictions for the bounds on the queuing delay individual jobs will experience. We evaluate this method using batch scheduler logs for distributed-memory parallel machines that cover a 9-year period at 7 large HPC centers.Our results show that it is possible to predict delay bounds reliably for jobs in different queues, and for jobs requesting different ranges of processor counts. Using this information, scientific application developers can intelligently decide where to submit their parallel codes in order to minimize overall turnaround time.