Load Balancing in MapReduce Based on Scalable Cardinality Estimates

Authors:
Benjamin Gufler;Nikolaus Augsten;Angelika Reiser;Alfons Kemper
Affiliations:
-;-;-;-
Venue:
ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Year:
2012

Citing 0
Cited 7

SkewTune in action: mitigating skew in MapReduce applications

Proceedings of the VLDB Endowment
Designing good algorithms for MapReduce and beyond

Proceedings of the Third ACM Symposium on Cloud Computing
Minimal MapReduce algorithms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
The case for tiny tasks in compute clusters

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
Bisimulation reduction of big graphs on mapreduce

BNCOD'13 Proceedings of the 29th British National conference on Big Data
Balancing reducer workload for skewed data using sampling-based partitioning

Computers and Electrical Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

MapReduce has emerged as a popular tool for distributed and scalable processing of massive data sets and is being used increasingly in e-science applications. Unfortunately, the performance of MapReduce systems strongly depends on an even data distribution while scientific data sets are often highly skewed. The resulting load imbalance, which raises the processing time, is even amplified by high runtime complexity of the reducer tasks. An adaptive load balancing strategy is required for appropriate skew handling. In this paper, we address the problem of estimating the cost of the tasks that are distributed to the reducers based on a given cost model. An accurate cost estimation is the basis for adaptive load balancing algorithms and requires to gather statistics from the mappers. This is challenging: (a) Since the statistics from all mappers must be integrated, the mapper statistics must be small. (b) Although each mapper sees only a small fraction of the data, the integrated statistics must capture the global data distribution. (c) The mappers terminate after sending the statistics to the controller, and no second round is possible. Our solution to these challenges consists of two components. First, a monitoring component executed on every mapper captures the local data distribution and identifies its most relevant subset for cost estimation. Second, an integration component aggregates these subsets approximating the global data distribution.