Jockey: guaranteed job latency in data parallel clusters

Authors:
Andrew D. Ferguson;Peter Bodik;Srikanth Kandula;Eric Boutin;Rodrigo Fonseca
Affiliations:
Brown University, Providence, RI, USA;Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA;Microsoft Bing, Redmond, WA, USA;Brown University, Providence, RI, USA
Venue:
Proceedings of the 7th ACM european conference on Computer Systems
Year:
2012

Citing 26
Cited 12

Utilization-Based Admission Control for Real-Time Applications

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Adaptive Resource Management in Asynchronous Real-Time Distributed Systems Using Feedback Control Functions

ISADS '01 Proceedings of the Fifth International Symposium on Autonomous Decentralized Systems
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Value-maximizing deadline scheduling and its application to animation rendering

Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
Dynamic Provisioning of Multi-tier Internet Applications

ICAC '05 Proceedings of the Second International Conference on Automatic Computing
A statistical approach to risk mitigation in computational markets

Proceedings of the 16th international symposium on High performance distributed computing
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Online Optimization for Latency Assignment in Distributed Real-Time Systems

ICDCS '08 Proceedings of the 2008 The 28th International Conference on Distributed Computing Systems
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
Validity of the single processor approach to achieving large scale computing capabilities

AFIPS '67 (Spring) Proceedings of the April 18-20, 1967, spring joint computer conference
MapReduce optimization using regulated dynamic prioritization

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Quincy: fair scheduling for distributed computing clusters

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling

Proceedings of the 5th European conference on Computer systems
Prediction-based enforcement of performance contracts

GECON'07 Proceedings of the 4th international conference on Grid economics and business models
FlumeJava: easy, efficient data-parallel pipelines

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
ParaTimer: a progress indicator for MapReduce DAGs

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
The impact of virtualization on network performance of amazon EC2 data center

INFOCOM'10 Proceedings of the 29th conference on Information communications
CloudCmp: comparing public cloud providers

IMC '10 Proceedings of the 10th ACM SIGCOMM conference on Internet measurement
Reining in the outliers in map-reduce clusters using Mantri

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Scarlett: coping with skewed content popularity in mapreduce clusters

Proceedings of the sixth conference on Computer systems
CIEL: a universal execution engine for distributed data-flow computing

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Apache hadoop goes realtime at Facebook

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics

Proceedings of the 2nd ACM Symposium on Cloud Computing

Optimizing data shuffling in data-parallel computation by understanding user-defined functions

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Automated diagnosis without predictability is a recipe for failure

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Bridging the tenant-provider gap in cloud services

Proceedings of the Third ACM Symposium on Cloud Computing
Cake: enabling high-level SLOs on shared storage systems

Proceedings of the Third ACM Symposium on Cloud Computing
alsched: algebraic scheduling of mixed workloads in heterogeneous clouds

Proceedings of the Third ACM Symposium on Cloud Computing
CloudPack* exploiting workload flexibility through rational pricing

Proceedings of the 13th International Middleware Conference
Building and scaling virtual clusters with residual resources from interactive clouds

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Omega: flexible, scalable schedulers for large compute clusters

Proceedings of the 8th ACM European Conference on Computer Systems
Speeding up distributed request-response workflows

Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM
Efficient online scheduling for deadline-sensitive jobs: extended abstract

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Natjam: design and evaluation of eviction policies for supporting priorities and deadlines in mapreduce clusters

Proceedings of the 4th annual Symposium on Cloud Computing
Agile middleware for scheduling: meeting competing performance requirements of diverse tasks

Proceedings of the 5th ACM/SPEC international conference on Performance engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data processing frameworks such as MapReduce [8] and Dryad [11] are used today in business environments where customers expect guaranteed performance. To date, however, these systems are not capable of providing guarantees on job latency because scheduling policies are based on fair-sharing, and operators seek high cluster use through statistical multiplexing and over-subscription. With Jockey, we provide latency SLOs for data parallel jobs written in SCOPE. Jockey precomputes statistics using a simulator that captures the job's complex internal dependencies, accurately and efficiently predicting the remaining run time at different resource allocations and in different stages of the job. Our control policy monitors a job's performance, and dynamically adjusts resource allocation in the shared cluster in order to maximize the job's economic utility while minimizing its impact on the rest of the cluster. In our experiments in Microsoft's production Cosmos clusters, Jockey meets the specified job latency SLOs and responds to changes in cluster conditions.