Balancing reducer skew in MapReduce workloads using progressive sampling

  • Authors:
  • Smriti R. Ramakrishnan;Garret Swart;Aleksey Urmanov

  • Affiliations:
  • Oracle Corporation;Oracle Corporation;Oracle Corporation

  • Venue:
  • Proceedings of the Third ACM Symposium on Cloud Computing
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

The elapsed time of a parallel job depends on the completion time of its longest running constituent. We present a static load balancing algorithm that distributes work evenly across the reducers in a MapReduce job resulting in significant elapsed time reductions. Taking a user-specified model of reducer performance, our load balancer uses a progressive objective-based cluster sampler to estimate the load associated with each reduce-key. It balances the workload using Key Chopping, to split keys with large loads into sub-keys that can be assigned to different distributive reducers, and Key Packing, to assign keys with medium loads to reducers to minimize the maximum reducer load. Keys with small loads are hashed as they have little effect on the balance. This repeats until the user specified balancing objective and confidence level are achieved. The sampler and load balancer have been implemented in the Oracle Loader for Hadoop (OLH), a commercial MapReduce application that employs Apache Hadoop to perform parallel data formatting and data movement into partitioned relational tables. We present the performance improvements we achieve in both OLH and in a MapReduce program for inverted index creation. The balancer works for arbitrary IID key distributions, the time used for sampling is small and our solution is very effective at reducing the elapsed time for the MapReduce jobs we explored.