Balancing reducer workload for skewed data using sampling-based partitioning

  • Authors:
  • Yujie Xu;Wenyu Qu;Zhiyang Li;Zhaobin Liu;Changqing Ji;Yuanyuan Li;Haifeng Li

  • Affiliations:
  • -;-;-;-;-;-;-

  • Venue:
  • Computers and Electrical Engineering
  • Year:
  • 2014

Quantified Score

Hi-index 0.00

Visualization

Abstract

MapReduce has emerged as a popular tool for distributed processing of massive data. However, it is not efficient when handling skewed data and it often leads to reducer load imbalance. In this paper, we address the problem of how to efficiently partition intermediate keys to balance the workload of all reducers when processing skewed data. We present a sampling scheme to compute the approximate distribution of key frequency, estimate the overall distribution and then make a partition scheme in advance. Then, we apply it to map phase of the executing MapReduce job. This work not only provides a load-balanced partition strategy, but also keeps a high performance of synchronous mode of MapReduce. We also propose two partition methods based on sampling results: cluster combination and cluster split combination. The experimental results show that our methods achieve a better time and load balancing results.