Balancing reducer workload for skewed data using sampling-based partitioning

Authors:
Yujie Xu;Wenyu Qu;Zhiyang Li;Zhaobin Liu;Changqing Ji;Yuanyuan Li;Haifeng Li
Affiliations:
-;-;-;-;-;-;-
Venue:
Computers and Electrical Engineering
Year:
2014

Citing 17
Cited 0

MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Handling data skew in parallel joins in shared-nothing systems

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Experiences on Processing Spatial Data with MapReduce

SSDBM 2009 Proceedings of the 21st International Conference on Scientific and Statistical Database Management
Efficient outer join data skew handling in parallel DBMS

Proceedings of the VLDB Endowment
Skew-resistant parallel processing of feature-extracting scientific user-defined functions

Proceedings of the 1st ACM symposium on Cloud computing
ParaTimer: a progress indicator for MapReduce DAGs

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A Novel Method for Estimating Flow Length Distributions from Double-Sampled Flow Statistics

HPCC '10 Proceedings of the 2010 IEEE 12th International Conference on High Performance Computing and Communications
LEEN: Locality/Fairness-Aware Key Partitioning for MapReduce in the Cloud

CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
A platform for scalable one-pass analytics using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Clustering very large multi-dimensional datasets with MapReduce

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
SPARK2: Top-k Keyword Query in Relational Databases

IEEE Transactions on Knowledge and Data Engineering
Building wavelet histograms on large data in MapReduce

Proceedings of the VLDB Endowment
Locality-Aware Reduce Task Scheduling for MapReduce

CLOUDCOM '11 Proceedings of the 2011 IEEE Third International Conference on Cloud Computing Technology and Science
SkewTune: mitigating skew in mapreduce applications

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Extending Map-Reduce for Efficient Predicate-Based Sampling

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Load Balancing in MapReduce Based on Scalable Cardinality Estimates

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Inverted Grid-Based kNN Query Processing with MapReduce

CHINAGRID '12 Proceedings of the 2012 Seventh ChinaGrid Annual Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

MapReduce has emerged as a popular tool for distributed processing of massive data. However, it is not efficient when handling skewed data and it often leads to reducer load imbalance. In this paper, we address the problem of how to efficiently partition intermediate keys to balance the workload of all reducers when processing skewed data. We present a sampling scheme to compute the approximate distribution of key frequency, estimate the overall distribution and then make a partition scheme in advance. Then, we apply it to map phase of the executing MapReduce job. This work not only provides a load-balanced partition strategy, but also keeps a high performance of synchronous mode of MapReduce. We also propose two partition methods based on sampling results: cluster combination and cluster split combination. The experimental results show that our methods achieve a better time and load balancing results.