Scheduling HPC workflows for responsiveness and fairness with networking delays and inaccurate estimates of execution times

Authors:
Andrew Burkimsher;Iain Bate;Leandro Soares Indrusiak
Affiliations:
Department of Computer Science, University of York, York, UK;Department of Computer Science, University of York, York, UK;Department of Computer Science, University of York, York, UK
Venue:
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Year:
2013

Citing 12
Cited 0

On the complexity of task allocation

Complexity
Benchmarking and comparison of the task graph scheduling algorithms

Journal of Parallel and Distributed Computing
Introduction to algorithms

Introduction to algorithms
Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing

IEEE Transactions on Parallel and Distributed Systems
Job Characteristics of a Production Parallel Scientivic Workload on the NASA Ames iPSC/860

IPPS '95 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Characteristics of a Large Shared Memory Production Workload

JSSPP '01 Revised Papers from the 7th International Workshop on Job Scheduling Strategies for Parallel Processing
Trace-based evaluation of job runtime and queue wait time predictions in grids

Proceedings of the 18th ACM international symposium on High performance distributed computing
Reducing complexity in tree-like computer interconnection networks

Parallel Computing
Scheduling multiple DAGs onto heterogeneous systems

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Are user runtime estimates inherently inaccurate?

JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
High level QoS-driven model for Grid applications in a simulated environment

Future Generation Computer Systems
Multiple Workflow Scheduling Strategies with User Run Time Estimates on a Grid

Journal of Grid Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

High-Performance Computing systems (HPCs) have grown in popularity in recent years, especially in the form of Grid and Cloud platforms. These platforms may be subject to periods of overload. In our previous research, we found that the Projected-SLR list scheduling policy provides responsiveness and a starvation-free scheduling guarantee in a realistic HPC scenario. This paper extends the previous work to consider networking delays in the platform model and inaccurate estimates of execution times in the application model. P-SLR is shown to be competitive with the best alternative scheduling policies in the presence of network costs (up to 400% computation time) and where execution time estimate inaccuracies are within generous error bounds (