Evaluation of a workflow scheduler using integrated performance modelling and batch queue wait time prediction

Authors:
Daniel Nurmi;Anirban Mandal;John Brevik;Chuck Koelbel;Rich Wolski;Ken Kennedy
Affiliations:
University of California, Santa Barbara, Santa Barbara, California;Rice University, Houston, Texas;University of California, Santa Barbara, Santa Barbara, California;Rice University, Houston, Texas;University of California, Santa Barbara, Santa Barbara, California;Rice University, Houston, Texas
Venue:
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Year:
2006

Citing 18
Cited 22

Predictive analysis of a wavefront application using LogGP

Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
The network weather service: a distributed resource performance forecasting service for metacomputing

Future Generation Computer Systems - Special issue on metacomputing
Static scheduling algorithms for allocating directed task graphs to multiprocessors

ACM Computing Surveys (CSUR)
A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems

Journal of Parallel and Distributed Computing
Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing

IEEE Transactions on Parallel and Distributed Systems
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
Predictive performance and scalability modeling of a large-scale application

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
A Compile-Time Scheduling Heuristic for Interconnection-Constrained Heterogeneous Processor Architectures

IEEE Transactions on Parallel and Distributed Systems
Predicting Queue Times on Space-Sharing Parallel Computers

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Using Queue Time Predictions for Processor Allocation

IPPS '97 Proceedings of the Job Scheduling Strategies for Parallel Processing
A Static Scheduling Heuristic for Heterogeneous Processors

Euro-Par '96 Proceedings of the Second International Euro-Par Conference on Parallel Processing-Volume II
Prophesy: an infrastructure for performance analysis and modeling of parallel and grid applications

ACM SIGMETRICS Performance Evaluation Review
Predicting the CPU Availability of Time-Shared Unix Systems on the Computational Grid

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Cross-architecture performance predictions for scientific applications using parameterized models

Proceedings of the joint international conference on Measurement and modeling of computer systems
Performance Prophet: A Performance Modeling and Prediction Tool for Parallel and Distributed Programs

ICPPW '05 Proceedings of the 2005 International Conference on Parallel Processing Workshops
Predicting bounds on queuing delay for batch-scheduled parallel machines

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Task scheduling strategies for workflow-based applications in grids

CCGRID '05 Proceedings of the Fifth IEEE International Symposium on Cluster Computing and the Grid (CCGrid'05) - Volume 2 - Volume 02
Scheduling strategies for mapping application workflows onto the grid

HPDC '05 Proceedings of the High Performance Distributed Computing, 2005. HPDC-14. Proceedings. 14th IEEE International Symposium

A provisioning model and its comparison with best-effort for performance-cost optimization in grids

Proceedings of the 16th international symposium on High performance distributed computing
Scheduling mixed-parallel applications with advance reservations

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Using historical accounting information to predict the resource usage of grid jobs

Future Generation Computer Systems
Scientific workflow scheduling in computational grids Planning, reservation, and data/network-awareness

GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
Modeling Job Lifespan Delays in Volunteer Computing Projects

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Scheduling mixed-parallel applications with advance reservations

Cluster Computing
Grids with multiple batch systems for performance enhancement of multi-component and parameter sweep parallel applications

Future Generation Computer Systems
A simulation toolkit to investigate the effects of grid characteristics on workflow completion time

Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science
Modeling the latency on production grids with respect to the execution context

Parallel Computing
VGrADS: enabling e-Science workflows on grids and clouds with fault tolerance

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
QBETS: queue bounds estimation from time series

JSSPP'07 Proceedings of the 13th international conference on Job scheduling strategies for parallel processing
TeraGrid resource selection tools: a road test

Proceedings of the 2010 TeraGrid Conference
Comparison of resource platform selection approaches for scientific workflows

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Towards optimising distributed data streaming graphs using parallel streams

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Deadline-sensitive workflow orchestration without explicit resource control

Journal of Parallel and Distributed Computing
Queue waiting time aware dynamic workflow scheduling in multicluster environments

Journal of Computer Science and Technology
Automatic performance model synthesis from hardware verification models

Proceedings of the 2nd ACM/SPEC International Conference on Performance engineering
Dynamic scheduling of a batch of parallel task jobs on heterogeneous clusters

Parallel Computing
Flexible resource allocation for reliable virtual cluster computing systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Adaptive Executions of Multi-Physics Coupled Applications on Batch Grids

Journal of Grid Computing
Large improvements in application throughput of long-running multi-component applications using batch grids

Concurrency and Computation: Practice & Experience
Modeling energy consumption for master---slave applications

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large-scale distributed systems offer computational power at unprecedented levels. In the past, HPC users typically had access to relatively few individual supercomputers and, in general, would assign a one-to-one mapping of applications to machines. Modern HPC users have simultaneous access to a large number of individual machines and are beginning to make use of all of them for single-application execution cycles. One method that application developers have devised in order to take advantage of such systems is to organize an entire application execution cycle as a workflow. The scheduling of such workflows has been the topic of a great deal of research in the past few years and, although very sophisticated algorithms have been devised, a very specific aspect of these distributed systems, namely that most supercomputing resources employ batch queue scheduling software, has heretofore been omitted from consideration, presumably because it is difficult to model accurately. In this work, we augment an existing workflow scheduler through the introduction of methods which make accurate predictions of both the performance of the application on specific hardware, and the amount of time individual workflow tasks will spend waiting in batch queues. Our results show that although a workflow scheduler alone may choose correct task placement based on data locality or network connectivity, this benefit is often compromised by the fact that most jobs submitted to current systems must wait in overcommited batch queues for a significant portion of time. However, incorporating the enhancements we describe improves workflow execution time in settings where batch queues impose significant delays on constituent workflow tasks.