VARQ: virtual advance reservations for queues

Authors:
Daniel Charles Nurmi;Rich Wolski;John Brevik
Affiliations:
University of California Santa Barbara, Santa Barbara, USA;University of California Santa Barbara, Santa Barbara, USA;California State University Long Beach, Long Beach, USA
Venue:
HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Year:
2008

Citing 16
Cited 6

The grid: blueprint for a new computing infrastructure

The grid: blueprint for a new computing infrastructure
Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling

IEEE Transactions on Parallel and Distributed Systems
Predicting Queue Times on Space-Sharing Parallel Computers

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
The ANL/IBM SP Scheduling System

IPPS '95 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Using Queue Time Predictions for Processor Allocation

IPPS '97 Proceedings of the Job Scheduling Strategies for Parallel Processing
A Resource Management Architecture for Metacomputing Systems

IPPS/SPDP '98 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Core Algorithms of the Maui Scheduler

JSSPP '01 Revised Papers from the 7th International Workshop on Job Scheduling Strategies for Parallel Processing
The Performance of Processor Co-Allocation in Multicluster Systems

CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
Scheduling with Advanced Reservations

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Grid Computing: Making the Global Infrastructure a Reality

Grid Computing: Making the Global Infrastructure a Reality
On Advantages of Grid Computing for Parallel Job Scheduling

CCGRID '02 Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid
Addressing strategic behavior in a deployed microeconomic resource allocator

Proceedings of the 2005 ACM SIGCOMM workshop on Economics of peer-to-peer systems
Predicting bounds on queuing delay for batch-scheduled parallel machines

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Why markets could (but don't currently) solve resource allocation problems in systems

HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
QBETS: queue bounds estimation from time series

JSSPP'07 Proceedings of the 13th international conference on Job scheduling strategies for parallel processing
On the impact of reservations from the grid on planning-based resource management

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part III

VGrADS: enabling e-Science workflows on grids and clouds with fault tolerance

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Experiences with resource provisioning for scientific workflows using Corral

Scientific Programming
Overdimensioning for Consistent Performance in Grids

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Deadline-sensitive workflow orchestration without explicit resource control

Journal of Parallel and Distributed Computing
Network-aware meta-scheduling in advance with autonomous self-tuning system

Future Generation Computer Systems
Optimal resource allocation for time-reservation systems

Performance Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

In high-performance computing (HPC) settings, in which multiprocessor machines are shared among users with potentially competing resource demands, processors are allocated to user workload using space sharing. Typically, users interact with a given machine by submitting their jobs to a centralized batch scheduler that implements a site-specific policy designed to maximize machine utilization while providing tolerable turn-around times. To these users, the functioning of the batch scheduler and the policies it implements are both critical operating system components since they control how each job is serviced. In practice, while most HPC systems experience good utilization levels, the amount of time experienced by individual jobs waiting to begin execution has been shown to be highly variable and difficult to predict, leading to user confusion and/or frustration. One method for dealing with this uncertainty that has been proposed is to allow users who are willing to plan ahead to make "advanced reservations" for processor resources. To date, however, few HPC centers provide an advanced reservation capability to their general user populations since previous research indicates that diminished machine utilization will occur if and when advanced reservations are introduced. In this work, we describe VARQ, a new method for job scheduling that provides users with probabilistic "virtual" advanced reservations using only existing best effort batch schedulers. VARQ functions as an overlay, submitting jobs that are indistinguishable from the normal workload serviced by a scheduler. We describe the statistical methods we use to implement VARQ, detail an empirical evaluation of its effectiveness in a number of HPC settings, and explore the potential future impact of VARQ should it become widely used. Without requiring HPC sites to support advanced reservations, we find that VARQ can implement a reservation capability probabilistically and that the effects of this probabilistic approach are unlikely to negatively affect resource utilization.