QBETS: queue bounds estimation from time series

  • Authors:
  • Daniel Nurmi;John Brevik;Rich Wolski

  • Affiliations:
  • Computer Science Department, University of California, Santa Barbara, Santa Barbara, California;Mathematics and Statistics Department, California State University, Long Beach, Long Beach, California;Computer Science Department, University of California, Santa Barbara, Santa Barbara, California

  • Venue:
  • JSSPP'07 Proceedings of the 13th international conference on Job scheduling strategies for parallel processing
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Most space-sharing parallel computers presently operated by high-performance computing centers use batch-queuing systems to manage processor allocation. Because these machines are typically "space-shared," each job must wait in a queue until sufficient processor resources become available to service it. In production computing settings, the queuing delay (experienced by users as the time between when the job is submitted and when it begins execution) is highly variable. Users often find this variability a drag on productivity as it makes planning difficult and intellectual continuity hard to maintain. In this work, we introduce an on-line system for predicting batch-queue delay and show that it generates correct and accurate bounds for queuing delay for batch jobs from 11 machines over a 9-year period. Our system comprises 4 novel and interacting components: a predictor based on nonparametric inference; an automated change-point detector; machine-learned, model-based clustering of jobs having similar characteristics; and an automatic downtime detector to identify systemic failures that affect job queuing delay. We compare the correctness and accuracy of our system against various previously used prediction techniques and show that our new method outperforms them for all machines we have available for study.