MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Handling data skew in parallel joins in shared-nothing systems
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Experiences on Processing Spatial Data with MapReduce
SSDBM 2009 Proceedings of the 21st International Conference on Scientific and Statistical Database Management
Efficient outer join data skew handling in parallel DBMS
Proceedings of the VLDB Endowment
Skew-resistant parallel processing of feature-extracting scientific user-defined functions
Proceedings of the 1st ACM symposium on Cloud computing
ParaTimer: a progress indicator for MapReduce DAGs
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A Novel Method for Estimating Flow Length Distributions from Double-Sampled Flow Statistics
HPCC '10 Proceedings of the 2010 IEEE 12th International Conference on High Performance Computing and Communications
LEEN: Locality/Fairness-Aware Key Partitioning for MapReduce in the Cloud
CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
A platform for scalable one-pass analytics using MapReduce
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Clustering very large multi-dimensional datasets with MapReduce
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
SPARK2: Top-k Keyword Query in Relational Databases
IEEE Transactions on Knowledge and Data Engineering
Building wavelet histograms on large data in MapReduce
Proceedings of the VLDB Endowment
Locality-Aware Reduce Task Scheduling for MapReduce
CLOUDCOM '11 Proceedings of the 2011 IEEE Third International Conference on Cloud Computing Technology and Science
SkewTune: mitigating skew in mapreduce applications
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Extending Map-Reduce for Efficient Predicate-Based Sampling
ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Load Balancing in MapReduce Based on Scalable Cardinality Estimates
ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Inverted Grid-Based kNN Query Processing with MapReduce
CHINAGRID '12 Proceedings of the 2012 Seventh ChinaGrid Annual Conference
Hi-index | 0.00 |
MapReduce has emerged as a popular tool for distributed processing of massive data. However, it is not efficient when handling skewed data and it often leads to reducer load imbalance. In this paper, we address the problem of how to efficiently partition intermediate keys to balance the workload of all reducers when processing skewed data. We present a sampling scheme to compute the approximate distribution of key frequency, estimate the overall distribution and then make a partition scheme in advance. Then, we apply it to map phase of the executing MapReduce job. This work not only provides a load-balanced partition strategy, but also keeps a high performance of synchronous mode of MapReduce. We also propose two partition methods based on sampling results: cluster combination and cluster split combination. The experimental results show that our methods achieve a better time and load balancing results.