Scarlett: coping with skewed content popularity in mapreduce clusters

Authors:
Ganesh Ananthanarayanan;Sameer Agarwal;Srikanth Kandula;Albert Greenberg;Ion Stoica;Duke Harlan;Ed Harris
Affiliations:
University of California, Berkeley, Berkeley, CA, USA;University of California, Berkeley, Berkeley, CA, USA;Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA;University of California, Berkeley, Berkeley, CA, USA;Microsoft Bing, Redmond, WA, USA;Microsoft Bing, Redmond, WA, USA
Venue:
Proceedings of the sixth conference on Computer systems
Year:
2011

Citing 22
Cited 21

Efficient fair queueing using deficit round robin

SIGCOMM '95 Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication
Increasing Predictive Accuracy by Prefetching Multiple Program and User Specific Files

HPCS '02 Proceedings of the 16th Annual International Symposium on High Performance Computing Systems and Applications
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Variable-length contexts for PPM

DCC '04 Proceedings of the Conference on Data Compression
Database replication policies for dynamic content applications

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Beehive: O(1)lookup performance for power-law query distributions in peer-to-peer overlays

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Beehive: O(1)lookup performance for power-law query distributions in peer-to-peer overlays

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
A scalable, commodity data center network architecture

Proceedings of the ACM SIGCOMM 2008 conference on Data communication
Flexible, wide-area storage for distributed systems with WheelFS

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
PADS: a policy architecture for distributed storage systems

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
VL2: a scalable and flexible data center network

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Quincy: fair scheduling for distributed computing clusters

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
The nature of data center traffic: measurements & analysis

Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling

Proceedings of the 5th European conference on Computer systems
To compress or not to compress - compute vs. IO tradeoffs for mapreduce energy efficiency

Proceedings of the first ACM SIGCOMM workshop on Green networking
Providing a cloud network infrastructure on a supercomputer

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
SRCMap: energy proportional storage using dynamic consolidation

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Everest: scaling down peak loads through I/O off-loading

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Nectar: automatic management of data and computation in datacenters

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Reining in the outliers in map-reduce clusters using Mantri

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation

Disk-locality in datacenter computing considered irrelevant

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Energy efficiency for large-scale MapReduce workloads with significant interactive analysis

Proceedings of the 7th ACM european conference on Computer Systems
Jockey: guaranteed job latency in data parallel clusters

Proceedings of the 7th ACM european conference on Computer Systems
The HaLoop approach to large-scale iterative data analysis

The VLDB Journal — The International Journal on Very Large Data Bases
PACMan: coordinated memory caching for parallel jobs

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Optimizing data shuffling in data-parallel computation by understanding user-defined functions

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Maestro: Replica-Aware Map Scheduling for MapReduce

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
The seven deadly sins of cloud computing research

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Why let resources idle? aggressive cloning of jobs with dolly

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads

Proceedings of the VLDB Endowment
True elasticity in multi-tenant data-intensive compute clusters

Proceedings of the Third ACM Symposium on Cloud Computing
Tiled-MapReduce: Efficient and Flexible MapReduce Processing on Multicore with Tiling

ACM Transactions on Architecture and Code Optimization (TACO)
MRBS: towards dependability benchmarking for hadoop mapreduce

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Interference and locality-aware task scheduling for MapReduce applications in virtual clusters

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
DBalancer: distributed load balancing for NoSQL data-stores

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
A throughput optimal algorithm for map task scheduling in mapreduce with data locality

ACM SIGMETRICS Performance Evaluation Review
Octopus: efficient data intensive computing on virtualized datacenters

Proceedings of the 6th International Systems and Storage Conference
Leveraging endpoint flexibility in data-intensive clusters

Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM
The case for tiny tasks in compute clusters

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Performance troubleshooting in data centers: an annotated bibliography?

ACM SIGOPS Operating Systems Review
iPACS: Power-aware covering sets for energy proportionality and performance in data parallel computing clusters

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

To improve data availability and resilience MapReduce frameworks use file systems that replicate data uniformly. However, analysis of job logs from a large production cluster shows wide disparity in data popularity. Machines and racks storing popular content become bottlenecks; thereby increasing the completion times of jobs accessing this data even when there are machines with spare cycles in the cluster. To address this problem, we present Scarlett, a system that replicates blocks based on their popularity. By accurately predicting file popularity and working within hard bounds on additional storage, Scarlett causes minimal interference to running jobs. Trace driven simulations and experiments in two popular MapReduce frameworks (Hadoop, Dryad) show that Scarlett effectively alleviates hotspots and can speed up jobs by 20.2%.