Why let resources idle? aggressive cloning of jobs with dolly

Authors:
Ganesh Ananthanarayanan;Ali Ghodsi;Scott Shenker;Ion Stoica
Affiliations:
University of California, Berkeley;University of California, Berkeley;University of California, Berkeley;University of California, Berkeley
Venue:
HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Year:
2012

Citing 14
Cited 5

MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Quincy: fair scheduling for distributed computing clusters

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
NapSAC: design and implementation of a power-proportional web cluster

Proceedings of the first ACM SIGCOMM workshop on Green networking
Improving MapReduce performance in heterogeneous environments

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Dremel: interactive analysis of web-scale datasets

Proceedings of the VLDB Endowment
Reining in the outliers in map-reduce clusters using Mantri

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Scarlett: coping with skewed content popularity in mapreduce clusters

Proceedings of the sixth conference on Computer systems
Power management of online data-intensive services

Proceedings of the 38th annual international symposium on Computer architecture
Benefits and limitations of tapping into stored energy for datacenters

Proceedings of the 38th annual international symposium on Computer architecture
Better never than late: meeting deadlines in datacenter networks

Proceedings of the ACM SIGCOMM 2011 conference
Warehouse-Scale Computing: Entering the Teenage Decade

Proceedings of the 38th annual international symposium on Computer architecture
Energy efficiency for large-scale MapReduce workloads with significant interactive analysis

Proceedings of the 7th ACM european conference on Computer Systems
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
PACMan: coordinated memory caching for parallel jobs

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation

More is less: reducing latency via redundancy

Proceedings of the 11th ACM Workshop on Hot Topics in Networks
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Sparrow: distributed, low latency scheduling

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Low latency via redundancy

Proceedings of the ninth ACM conference on Emerging networking experiments and technologies
PIKACHU: how to rebalance load in optimizing mapreduce on heterogeneous clusters

USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

Despite prior research on outlier mitigation, our analysis of jobs from the Facebook cluster shows that outliers still occur, especially in small jobs. Small jobs are particularly sensitive to long-running outlier tasks because of their interactive nature. Outlier mitigation strategies rely on comparing different tasks of the same job and launching speculative copies for the slower tasks. However, small jobs execute all their tasks simultaneously, thereby not providing sufficient time to observe and compare tasks. Building on the observation that clusters are underutilized, we take speculation to its logical extreme--run full clones of jobs to mitigate the effect of outliers. The heavy-tail distribution of job sizes implies that we can impact most jobs without using much resources. Trace-driven simulations show that average completion time of all the small jobs improves by 47% using cloning, at the cost of just 3% extra resources.