Backfilling Using System-Generated Predictions Rather than User Runtime Estimates

Authors:
Dan Tsafrir;Yoav Etsion;Dror G. Feitelson
Affiliations:
IEEE;IEEE;IEEE
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2007

Citing 27
Cited 51

Randomization, speculation, and adaptation in batch schedulers

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling

IEEE Transactions on Parallel and Distributed Systems
Using moldability to improve the performance of supercomputer jobs

Journal of Parallel and Distributed Computing
When the Herd Is Smart: Aggregate Behavior in the Selection of Job Request

IEEE Transactions on Parallel and Distributed Systems
Predicting Queue Times on Space-Sharing Parallel Computers

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Production Job Scheduling for Parallel Shared Memory Systems

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
The ANL/IBM SP Scheduling System

IPPS '95 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Job Characteristics of a Production Parallel Scientivic Workload on the NASA Ames iPSC/860

IPPS '95 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
A Historical Application Profiler for Use by Parallel Schedulers

IPPS '97 Proceedings of the Job Scheduling Strategies for Parallel Processing
Predicting Application Run Times Using Historical Information

IPPS/SPDP '98 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Core Algorithms of the Maui Scheduler

JSSPP '01 Revised Papers from the 7th International Workshop on Job Scheduling Strategies for Parallel Processing
Multiple-Queue Backfilling Scheduling with Priorities and Reservations for Parallel Systems

JSSPP '02 Revised Papers from the 8th International Workshop on Job Scheduling Strategies for Parallel Processing
Selective Reservation Strategies for Backfill Job Scheduling

JSSPP '02 Revised Papers from the 8th International Workshop on Job Scheduling Strategies for Parallel Processing
The Impact of More Accurate Requested Runtimes on Production Job Scheduling Performance

JSSPP '02 Revised Papers from the 8th International Workshop on Job Scheduling Strategies for Parallel Processing
An Integrated Approach to Parallel Scheduling Using Gang-Scheduling, Backfilling, and Migration

IEEE Transactions on Parallel and Distributed Systems
Job-Length Estimation and Performance in Backfilling Schedulers

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Predictive Application-Performance Modeling in a Computational Grid Environment

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Characterization of Backfilling Strategies for Parallel Job Scheduling

ICPPW '02 Proceedings of the 2002 International Conference on Parallel Processing Workshops
Scheduling with Advanced Reservations

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Utilization and Predictability in Scheduling the IBM SP2 with Backfilling

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Experimental Analysis of the Root Causes of Performance Evaluation Results: A Backfilling Case Study

IEEE Transactions on Parallel and Distributed Systems
Predicting job start times on clusters

CCGRID '04 Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid
Backfilling with lookahead to optimize the packing of parallel jobs

Journal of Parallel and Distributed Computing
Instability in parallel job scheduling simulation: the role of workload flurries

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Parallel job scheduling — a status report

JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Are user runtime estimates inherently inaccurate?

JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Modeling user runtime estimates

JSSPP'05 Proceedings of the 11th international conference on Job Scheduling Strategies for Parallel Processing

Secretly monopolizing the CPU without superuser privileges

SS'07 Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium
Using checkpointing to recover from poor multi-site parallel job scheduling decisions

Proceedings of the 5th international workshop on Middleware for grid computing: held at the ACM/IFIP/USENIX 8th International Middleware Conference
A probabilistic and adaptive scheduling algorithm using system-generated predictions for inter-grid resource sharing

The Journal of Supercomputing
The XtreemOS jScheduler: using self-scheduling techniques in large computing architectures

LASCO'08 First USENIX Workshop on Large-Scale Computing
Enhancing Prediction on Non-dedicated Clusters

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
On the use of meta-heuristics to increase the efficiency of online grid workflow scheduling algorithms

Cluster Computing
Incentives to Tight the Runtime Estimates of EASY Backfilling

ICDCN '09 Proceedings of the 10th International Conference on Distributed Computing and Networking
Trace-based evaluation of job runtime and queue wait time predictions in grids

Proceedings of the 18th ACM international symposium on High performance distributed computing
Resource Allocation Using Virtual Clusters

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
C-Meter: A Framework for Performance Analysis of Computing Clouds

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Rescheduling co-allocation requests based on flexible advance reservations and processor remapping

GRID '08 Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing
Minimizing dependencies within generic classes for faster and smaller programs

Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
The impact of error in user-provided bandwidth estimates on multi-site parallel job scheduling performance

PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
The impact of runtime estimation inaccuracy on scheduler performance

PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
Performance problems of using system-predicted runtimes for parallel job scheduling

PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
Spatio-temporal thermal-aware job scheduling to minimize energy consumption in virtualized heterogeneous data centers

Computer Networks: The International Journal of Computer and Telecommunications Networking
Load balancing on speed

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Probabilistic backfilling

JSSPP'07 Proceedings of the 13th international conference on Job scheduling strategies for parallel processing
On the Use of Machine Learning to Predict the Time and Resources Consumed by Applications

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Overdimensioning for Consistent Performance in Grids

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Performance analysis of dynamic workflow scheduling in multicluster grids

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
PV-EASY: a strict fairness guaranteed and prediction enabled scheduler in parallel job scheduling

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Optimal job packing, a backfill scheduling optimization for a cluster of workstations

The Journal of Supercomputing
On-line feedback-based automatic resource configuration for distributed applications

Cluster Computing
Adaps - A three-phase adaptive prediction system for the run-time of jobs based on user behaviour

Journal of Computer and System Sciences
Risk aware overbooking for commercial grids

JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
Multiplexing low and high QoS workloads in virtual environments

JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
Job Allocation Strategies with User Run Time Estimates for Online Scheduling in Hierarchical Grids

Journal of Grid Computing
On/off-line prediction applied to job scheduling on non-dedicated NOWs

Journal of Computer Science and Technology - Special issue on natural language processing
Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework

Proceedings of the 20th international symposium on High performance distributed computing
A multi-strategy collaborative prediction model for the runtime of online tasks in computing cluster/grid

Cluster Computing
Service control with the preemptive parallel job scheduler Scojo-PECT

Cluster Computing
Job status prediction - catch them before they fail

GPC'11 Proceedings of the 6th international conference on Advances in grid and pervasive computing
Backfilling with guarantees granted upon job submission

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
GreenSlot: scheduling energy consumption in green datacenters

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
The joschka system: organic job distribution in heterogeneous and unreliable environments

ARCS'10 Proceedings of the 23rd international conference on Architecture of Computing Systems
Adaptive parallel job scheduling with resource admissible allocation on two-level hierarchical grids

Future Generation Computer Systems
Coordinated rescheduling of Bag-of-Tasks for executions on multiple resource providers

Concurrency and Computation: Practice & Experience
Failure-aware resource provisioning for hybrid Cloud infrastructure

Journal of Parallel and Distributed Computing
Genetic algorithm calibration for two objective scheduling parallel jobs on hierarchical grids

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part II
ValuePack: value-based scheduling framework for CPU-GPU clusters

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A first step towards automatically building network representations

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
A bio-inspired distributed algorithm to improve scheduling performance of multi-broker grids

Natural Computing: an international journal
Multi-domain job coscheduling for leadership computing systems

The Journal of Supercomputing
Online cost-efficient scheduling of deadline-constrained workloads on hybrid clouds

Future Generation Computer Systems
MIP model scheduling for multi-clusters

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
State-based predictions with self-correction on Enterprise Desktop Grid environments

Journal of Parallel and Distributed Computing
Exploring portfolio scheduling for long-term execution of scientific workloads in IaaS clouds

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Extending goal-oriented parallel computer job scheduling policies to heterogeneous systems

The Journal of Supercomputing
Improving user QoS by relaxing resource reservation policy in high-performance grid environments

International Journal of Grid and Utility Computing
TLA: Temporal look-ahead processor allocation method for heterogeneous multi-cluster systems

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The most commonly used scheduling algorithm for parallel supercomputers is FCFS with backfilling, as originally introduced in the EASY scheduler. Backfilling means that short jobs are allowed to run ahead of their time provided they do not delay previously queued jobs (or at least the first queued job). To make such determinations possible, users are required to provide estimates of how long jobs will run, and jobs that violate these estimates are killed. Empirical studies have repeatedly shown that user estimates are inaccurate, and that system-generated predictions based on history may be significantly better. However, predictions have not been incorporated into production schedulers, partially due to a misconception (that we resolve) claiming inaccuracy actually improves performance, but mainly because underprediction is technically unacceptable: Users will not tolerate jobs being killed just because system predictions were too short. We solve this problem by divorcing kill-time from the runtime prediction and correcting predictions adaptively as needed if they are proved wrong. The end result is a surprisingly simple scheduler, which requires minimal deviations from current practices (e.g., using FCFS as the basis) and behaves exactly like EASY as far as users are concerned; nevertheless, it achieves significant improvements in performance, predictability, and accuracy. Notably, this is based on a very simple runtime predictor that just averages the runtimes of the last two jobs by the same user; counterintuitively, our results indicate that using recent data is more important than mining the history for similar jobs. All the techniques suggested in this paper can be used to enhance any backfilling algorithm and are not limited to EASY.