Data-driven batch scheduling

Authors:
John Bent;Timothy E. Denehy;Miron Livny;Andrea C. Arpaci-Dusseau;Remzi H. Arpaci-Dusseau
Affiliations:
Los Alamos National Lab, Los Alamos, NM, USA;Google Inc., Mountain View, CA, USA;University of Wisconsin-Madison, Madison, WI, USA;University of Wisconsin-Madison, Madison, WI, USA;University of Wisconsin-Madison, Madison, WI, USA
Venue:
Proceedings of the second international workshop on Data-aware distributed computing
Year:
2009

Citing 13
Cited 2

Serverless network file systems

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
The working set model for program behavior

Communications of the ACM
The ANL/IBM SP Scheduling System

IPPS '95 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Giggle: a framework for constructing scalable replica location services

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Evaluation of an Economy-Based File Replication Strategy for a Data Grid

CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
Decoupling Computation and Data Scheduling in Distributed Data-Intensive Applications

HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
Pipeline and Batch Sharing in Grid Workloads

HPDC '03 Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing
Simulation of Dynamic Data Replication Strategies in Data Grids

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Gang Scheduling with Memory Considerations

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Stork: Making Data Placement a First Class Citizen in the Grid

ICDCS '04 Proceedings of the 24th International Conference on Distributed Computing Systems (ICDCS'04)
The Anatomy of the Grid: Enabling Scalable Virtual Organizations

International Journal of High Performance Computing Applications
Co-scheduling of computation and data on computer clusters

SSDBM'2005 Proceedings of the 17th international conference on Scientific and statistical database management
Data-driven batch scheduling

Data-driven batch scheduling

Virtualized HPC: a contradiction in terms?

Software—Practice & Experience
DDS: A deadlock detection-based scheduling algorithm for workflow computations in HPC systems with storage constraints

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we develop data-driven strategies for batch computing schedulers. Current CPU-centric batch schedulers ignore the data needs within workloads and execute them by linking them transparently and directly to their needed data. When scheduled on remote computational resources, this elegant solution of direct data access can incur an order of magnitude performance penalty for data-intensive workloads. Adding data-awareness to batch schedulers allows a careful coordination of data and CPU allocation thereby reducing the cost of remote execution. We offer here new techniques by which batch schedulers can become data-driven. Such systems can use our analytical predictive models to select one of the four data-driven scheduling policies that we have created. Through simulation, we demonstrate the accuracy of our predictive models and show how they can reduce time to completion for some workloads by as much as 80%.