Scarlett: coping with skewed content popularity in mapreduce clusters

  • Authors:
  • Ganesh Ananthanarayanan;Sameer Agarwal;Srikanth Kandula;Albert Greenberg;Ion Stoica;Duke Harlan;Ed Harris

  • Affiliations:
  • University of California, Berkeley, Berkeley, CA, USA;University of California, Berkeley, Berkeley, CA, USA;Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA;University of California, Berkeley, Berkeley, CA, USA;Microsoft Bing, Redmond, WA, USA;Microsoft Bing, Redmond, WA, USA

  • Venue:
  • Proceedings of the sixth conference on Computer systems
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

To improve data availability and resilience MapReduce frameworks use file systems that replicate data uniformly. However, analysis of job logs from a large production cluster shows wide disparity in data popularity. Machines and racks storing popular content become bottlenecks; thereby increasing the completion times of jobs accessing this data even when there are machines with spare cycles in the cluster. To address this problem, we present Scarlett, a system that replicates blocks based on their popularity. By accurately predicting file popularity and working within hard bounds on additional storage, Scarlett causes minimal interference to running jobs. Trace driven simulations and experiments in two popular MapReduce frameworks (Hadoop, Dryad) show that Scarlett effectively alleviates hotspots and can speed up jobs by 20.2%.