QBETS: queue bounds estimation from time series

Authors:
Daniel Nurmi;John Brevik;Rich Wolski
Affiliations:
Computer Science Department, University of California, Santa Barbara, Santa Barbara, California;Mathematics and Statistics Department, California State University, Long Beach, Long Beach, California;Computer Science Department, University of California, Santa Barbara, Santa Barbara, California
Venue:
JSSPP'07 Proceedings of the 13th international conference on Job scheduling strategies for parallel processing
Year:
2007

Citing 14
Cited 16

Algorithms for clustering data

Algorithms for clustering data
Time Series Analysis: Forecasting and Control

Time Series Analysis: Forecasting and Control
Predicting Queue Times on Space-Sharing Parallel Computers

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
The ANL/IBM SP Scheduling System

IPPS '95 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Job Characteristics of a Production Parallel Scientivic Workload on the NASA Ames iPSC/860

IPPS '95 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Parallel Job Scheduling: Issues and Approaches

IPPS '95 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Towards Convergence in Job Schedulers for Parallel Supercomputers

IPPS '96 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Dynamic vs. Static Quantum-Based Parallel Processor Allocation

IPPS '96 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Using Queue Time Predictions for Processor Allocation

IPPS '97 Proceedings of the Job Scheduling Strategies for Parallel Processing
A unified framework for model-based clustering

The Journal of Machine Learning Research
Predicting bounds on queuing delay for batch-scheduled parallel machines

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
CASA and LEAD: Adaptive Cyberinfrastructure for Real-Time Multiscale Weather Forecasting

Computer
Evaluation of a workflow scheduler using integrated performance modelling and batch queue wait time prediction

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Modeling machine availability in enterprise and wide-area distributed computing environments

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing

Probabilistic advanced reservations for batch-scheduled parallel machines

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
VARQ: virtual advance reservations for queues

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
On the Efficacy of Computation Offloading Decision-Making Strategies

International Journal of High Performance Computing Applications
Using historical accounting information to predict the resource usage of grid jobs

Future Generation Computer Systems
VGrADS: enabling e-Science workflows on grids and clouds with fault tolerance

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
TeraGrid resource selection tools: a road test

Proceedings of the 2010 TeraGrid Conference
Case study for running HPC applications in public clouds

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
An advance reservation-based co-allocation algorithm for distributed computers and network bandwidth on QoS-guaranteed grids

JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
Deadline-sensitive workflow orchestration without explicit resource control

Journal of Parallel and Distributed Computing
Network-aware meta-scheduling in advance with autonomous self-tuning system

Future Generation Computer Systems
Optimal resource allocation for time-reservation systems

Performance Evaluation
Automated grid probe system to improve end-to-end grid reliability for a science gateway

Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery
Modeling and synthesizing task placement constraints in Google compute clusters

Proceedings of the 2nd ACM Symposium on Cloud Computing
Coordinated rescheduling of Bag-of-Tasks for executions on multiple resource providers

Concurrency and Computation: Practice & Experience
Energy-efficient three-phase task scheduling heuristic for supporting distributed applications in cyber-physical systems

Proceedings of the 15th ACM international conference on Modeling, analysis and simulation of wireless and mobile systems
A comparative study of high-performance computing on the cloud

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most space-sharing parallel computers presently operated by high-performance computing centers use batch-queuing systems to manage processor allocation. Because these machines are typically "space-shared," each job must wait in a queue until sufficient processor resources become available to service it. In production computing settings, the queuing delay (experienced by users as the time between when the job is submitted and when it begins execution) is highly variable. Users often find this variability a drag on productivity as it makes planning difficult and intellectual continuity hard to maintain. In this work, we introduce an on-line system for predicting batch-queue delay and show that it generates correct and accurate bounds for queuing delay for batch jobs from 11 machines over a 9-year period. Our system comprises 4 novel and interacting components: a predictor based on nonparametric inference; an automated change-point detector; machine-learned, model-based clustering of jobs having similar characteristics; and an automatic downtime detector to identify systemic failures that affect job queuing delay. We compare the correctness and accuracy of our system against various previously used prediction techniques and show that our new method outperforms them for all machines we have available for study.