Randomization, speculation, and adaptation in batch schedulers
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
IEEE Transactions on Parallel and Distributed Systems
Using moldability to improve the performance of supercomputer jobs
Journal of Parallel and Distributed Computing
When the Herd Is Smart: Aggregate Behavior in the Selection of Job Request
IEEE Transactions on Parallel and Distributed Systems
Predicting Queue Times on Space-Sharing Parallel Computers
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Production Job Scheduling for Parallel Shared Memory Systems
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
The ANL/IBM SP Scheduling System
IPPS '95 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Job Characteristics of a Production Parallel Scientivic Workload on the NASA Ames iPSC/860
IPPS '95 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
A Historical Application Profiler for Use by Parallel Schedulers
IPPS '97 Proceedings of the Job Scheduling Strategies for Parallel Processing
Predicting Application Run Times Using Historical Information
IPPS/SPDP '98 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Core Algorithms of the Maui Scheduler
JSSPP '01 Revised Papers from the 7th International Workshop on Job Scheduling Strategies for Parallel Processing
Multiple-Queue Backfilling Scheduling with Priorities and Reservations for Parallel Systems
JSSPP '02 Revised Papers from the 8th International Workshop on Job Scheduling Strategies for Parallel Processing
Selective Reservation Strategies for Backfill Job Scheduling
JSSPP '02 Revised Papers from the 8th International Workshop on Job Scheduling Strategies for Parallel Processing
The Impact of More Accurate Requested Runtimes on Production Job Scheduling Performance
JSSPP '02 Revised Papers from the 8th International Workshop on Job Scheduling Strategies for Parallel Processing
An Integrated Approach to Parallel Scheduling Using Gang-Scheduling, Backfilling, and Migration
IEEE Transactions on Parallel and Distributed Systems
Job-Length Estimation and Performance in Backfilling Schedulers
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Predictive Application-Performance Modeling in a Computational Grid Environment
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Characterization of Backfilling Strategies for Parallel Job Scheduling
ICPPW '02 Proceedings of the 2002 International Conference on Parallel Processing Workshops
Scheduling with Advanced Reservations
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Utilization and Predictability in Scheduling the IBM SP2 with Backfilling
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Experimental Analysis of the Root Causes of Performance Evaluation Results: A Backfilling Case Study
IEEE Transactions on Parallel and Distributed Systems
Predicting job start times on clusters
CCGRID '04 Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid
Backfilling with lookahead to optimize the packing of parallel jobs
Journal of Parallel and Distributed Computing
Instability in parallel job scheduling simulation: the role of workload flurries
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Parallel job scheduling — a status report
JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Are user runtime estimates inherently inaccurate?
JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Modeling user runtime estimates
JSSPP'05 Proceedings of the 11th international conference on Job Scheduling Strategies for Parallel Processing
Secretly monopolizing the CPU without superuser privileges
SS'07 Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium
Using checkpointing to recover from poor multi-site parallel job scheduling decisions
Proceedings of the 5th international workshop on Middleware for grid computing: held at the ACM/IFIP/USENIX 8th International Middleware Conference
The Journal of Supercomputing
The XtreemOS jScheduler: using self-scheduling techniques in large computing architectures
LASCO'08 First USENIX Workshop on Large-Scale Computing
Enhancing Prediction on Non-dedicated Clusters
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Incentives to Tight the Runtime Estimates of EASY Backfilling
ICDCN '09 Proceedings of the 10th International Conference on Distributed Computing and Networking
Trace-based evaluation of job runtime and queue wait time predictions in grids
Proceedings of the 18th ACM international symposium on High performance distributed computing
Resource Allocation Using Virtual Clusters
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
C-Meter: A Framework for Performance Analysis of Computing Clouds
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Rescheduling co-allocation requests based on flexible advance reservations and processor remapping
GRID '08 Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing
Minimizing dependencies within generic classes for faster and smaller programs
Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
The impact of runtime estimation inaccuracy on scheduler performance
PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
Performance problems of using system-predicted runtimes for parallel job scheduling
PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
Computer Networks: The International Journal of Computer and Telecommunications Networking
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
JSSPP'07 Proceedings of the 13th international conference on Job scheduling strategies for parallel processing
On the Use of Machine Learning to Predict the Time and Resources Consumed by Applications
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Overdimensioning for Consistent Performance in Grids
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Performance analysis of dynamic workflow scheduling in multicluster grids
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
PV-EASY: a strict fairness guaranteed and prediction enabled scheduler in parallel job scheduling
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Optimal job packing, a backfill scheduling optimization for a cluster of workstations
The Journal of Supercomputing
Adaps - A three-phase adaptive prediction system for the run-time of jobs based on user behaviour
Journal of Computer and System Sciences
Risk aware overbooking for commercial grids
JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
Multiplexing low and high QoS workloads in virtual environments
JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
Job Allocation Strategies with User Run Time Estimates for Online Scheduling in Hierarchical Grids
Journal of Grid Computing
On/off-line prediction applied to job scheduling on non-dedicated NOWs
Journal of Computer Science and Technology - Special issue on natural language processing
Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework
Proceedings of the 20th international symposium on High performance distributed computing
Service control with the preemptive parallel job scheduler Scojo-PECT
Cluster Computing
Job status prediction - catch them before they fail
GPC'11 Proceedings of the 6th international conference on Advances in grid and pervasive computing
Backfilling with guarantees granted upon job submission
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
GreenSlot: scheduling energy consumption in green datacenters
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
The joschka system: organic job distribution in heterogeneous and unreliable environments
ARCS'10 Proceedings of the 23rd international conference on Architecture of Computing Systems
Adaptive parallel job scheduling with resource admissible allocation on two-level hierarchical grids
Future Generation Computer Systems
Coordinated rescheduling of Bag-of-Tasks for executions on multiple resource providers
Concurrency and Computation: Practice & Experience
Failure-aware resource provisioning for hybrid Cloud infrastructure
Journal of Parallel and Distributed Computing
Genetic algorithm calibration for two objective scheduling parallel jobs on hierarchical grids
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part II
ValuePack: value-based scheduling framework for CPU-GPU clusters
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A first step towards automatically building network representations
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
A bio-inspired distributed algorithm to improve scheduling performance of multi-broker grids
Natural Computing: an international journal
Multi-domain job coscheduling for leadership computing systems
The Journal of Supercomputing
Online cost-efficient scheduling of deadline-constrained workloads on hybrid clouds
Future Generation Computer Systems
MIP model scheduling for multi-clusters
Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
State-based predictions with self-correction on Enterprise Desktop Grid environments
Journal of Parallel and Distributed Computing
Exploring portfolio scheduling for long-term execution of scientific workloads in IaaS clouds
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Extending goal-oriented parallel computer job scheduling policies to heterogeneous systems
The Journal of Supercomputing
Improving user QoS by relaxing resource reservation policy in high-performance grid environments
International Journal of Grid and Utility Computing
TLA: Temporal look-ahead processor allocation method for heterogeneous multi-cluster systems
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
The most commonly used scheduling algorithm for parallel supercomputers is FCFS with backfilling, as originally introduced in the EASY scheduler. Backfilling means that short jobs are allowed to run ahead of their time provided they do not delay previously queued jobs (or at least the first queued job). To make such determinations possible, users are required to provide estimates of how long jobs will run, and jobs that violate these estimates are killed. Empirical studies have repeatedly shown that user estimates are inaccurate, and that system-generated predictions based on history may be significantly better. However, predictions have not been incorporated into production schedulers, partially due to a misconception (that we resolve) claiming inaccuracy actually improves performance, but mainly because underprediction is technically unacceptable: Users will not tolerate jobs being killed just because system predictions were too short. We solve this problem by divorcing kill-time from the runtime prediction and correcting predictions adaptively as needed if they are proved wrong. The end result is a surprisingly simple scheduler, which requires minimal deviations from current practices (e.g., using FCFS as the basis) and behaves exactly like EASY as far as users are concerned; nevertheless, it achieves significant improvements in performance, predictability, and accuracy. Notably, this is based on a very simple runtime predictor that just averages the runtimes of the last two jobs by the same user; counterintuitively, our results indicate that using recent data is more important than mining the history for similar jobs. All the techniques suggested in this paper can be used to enhance any backfilling algorithm and are not limited to EASY.