Predicting bounds on queuing delay for batch-scheduled parallel machines

Authors:
John Brevik;Daniel Nurmi;Rich Wolski
Affiliations:
University of California, Santa Barbara, CA;University of California, Santa Barbara, CA;University of California, Santa Barbara, CA
Venue:
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
2006

Citing 13
Cited 25

The grid: blueprint for a new computing infrastructure

The grid: blueprint for a new computing infrastructure
The network weather service: a distributed resource performance forecasting service for metacomputing

Future Generation Computer Systems - Special issue on metacomputing
The Basic Practice of Statistics with Cdrom

The Basic Practice of Statistics with Cdrom
Time Series Analysis: Forecasting and Control

Time Series Analysis: Forecasting and Control
Predicting Queue Times on Space-Sharing Parallel Computers

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
The ANL/IBM SP Scheduling System

IPPS '95 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Job Characteristics of a Production Parallel Scientivic Workload on the NASA Ames iPSC/860

IPPS '95 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Parallel Job Scheduling: Issues and Approaches

IPPS '95 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Towards Convergence in Job Schedulers for Parallel Supercomputers

IPPS '96 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Dynamic vs. Static Quantum-Based Parallel Processor Allocation

IPPS '96 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Using Queue Time Predictions for Processor Allocation

IPPS '97 Proceedings of the Job Scheduling Strategies for Parallel Processing
Grid Computing: Making the Global Infrastructure a Reality

Grid Computing: Making the Global Infrastructure a Reality
Modeling machine availability in enterprise and wide-area distributed computing environments

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing

On-Demand High Performance Computing: Image Guided Neuro-Surgery Feasibility Study

ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 2
Evaluation of a workflow scheduler using integrated performance modelling and batch queue wait time prediction

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
GridSAT: a system for solving satisfiability problems using a computational grid

Parallel Computing - Optimization on grids - Optimization for grids
Beyond Performance Tools: Measuring and Modeling Productivity in HPC

SE-HPC '07 Proceedings of the 3rd International Workshop on Software Engineering for High Performance Computing Applications
A statistical approach to risk mitigation in computational markets

Proceedings of the 16th international symposium on High performance distributed computing
Automatic resource specification generation for resource selection

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
VARQ: virtual advance reservations for queues

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Feedback-controlled resource sharing for predictable eScience

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
On the Efficacy of Computation Offloading Decision-Making Strategies

International Journal of High Performance Computing Applications
Adaptive pricing for resource reservations in Shared environments

GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
Trace-based evaluation of job runtime and queue wait time predictions in grids

Proceedings of the 18th ACM international symposium on High performance distributed computing
Developing autonomic distributed scientific applications: a case study from history matching using ensemblekalman-filters

GMAC '09 Proceedings of the 6th international conference industry session on Grids meets autonomic computing
Grids with multiple batch systems for performance enhancement of multi-component and parameter sweep parallel applications

Future Generation Computer Systems
A simulation toolkit to investigate the effects of grid characteristics on workflow completion time

Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science
VGrADS: enabling e-Science workflows on grids and clouds with fault tolerance

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Distributed Radiotherapy Simulation with the Webcom Workflow System

International Journal of High Performance Computing Applications
QBETS: queue bounds estimation from time series

JSSPP'07 Proceedings of the 13th international conference on Job scheduling strategies for parallel processing
TeraGrid resource selection tools: a road test

Proceedings of the 2010 TeraGrid Conference
Comparison of resource platform selection approaches for scientific workflows

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Power-aware proactive storage-tiering management for high-speed tiered-storage systems

SustainIT'10 Proceedings of the First USENIX conference on Sustainable information technology
Self-adaptive architectures for autonomic computational science

SOAR'09 Proceedings of the First international conference on Self-organizing architectures
Hybrid Computing-Where HPC meets grid and Cloud Computing

Future Generation Computer Systems
Service control with the preemptive parallel job scheduler Scojo-PECT

Cluster Computing
Modeling and synthesizing task placement constraints in Google compute clusters

Proceedings of the 2nd ACM Symposium on Cloud Computing
Adaptive Executions of Multi-Physics Coupled Applications on Batch Grids

Journal of Grid Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most space-sharing parallel computers presently operated by high-performance computing centers use batch-queuing systems to manage processor allocation. In many cases, users wishing to use these batch-queued resources have accounts at multiple sites and have the option of choosing at which site or sites to submit a parallel job. In such a situation, the amount of time a user's job will wait in any one batch queue can significantly impact the overall time a user waits from job submission to job completion. In this work, we explore a new method for providing end-users with predictions for the bounds on the queuing delay individual jobs will experience. We evaluate this method using batch scheduler logs for distributed-memory parallel machines that cover a 9-year period at 7 large HPC centers.Our results show that it is possible to predict delay bounds reliably for jobs in different queues, and for jobs requesting different ranges of processor counts. Using this information, scientific application developers can intelligently decide where to submit their parallel codes in order to minimize overall turnaround time.