Processing online aggregation on skewed data in mapreduce

  • Authors:
  • Yantao Gan;Xiaofeng Meng;Yingjie Shi

  • Affiliations:
  • Renmin University of China, Beijing, China;Renmin University of China, Beijing, China;Renmin University of China, Beijing, China

  • Venue:
  • Proceedings of the fifth international workshop on Cloud data management
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

In online aggregation, a system constantly maintains an estimate of the final answer to an aggregate query throughout execution, along with statistically meaningful bounds for the estimate's accuracy. Given the popularity of ad-hoc analytic query processing over enormous datasets, providing online aggregation in a large-scale, MapReduce environment is therefore an emerging important application need. However, existing work targeted at single-node centralized environment cannot be easily extended to fit the MapReduce paradigm. The substantial challenge lies in, given a number of input blocks, and given the prevalence of data skew, the runtime of upstream operators is uneven, so the set of intermediate results delivered to downstream operators at any particular point cannot be seen as a random sample, leading to biased estimates. In this paper, we analyze how data skew breaks the randomness in the distributed environment. To address that, we present a keep-order approach that accounts for biases that can arise when estimating aggregates over skewed dataset in a distributed environment. Moreover, we provide a pre-computing method to promise a fast result rate. A set of experiments indicates that our method can provide reasonable precise estimates early in the execution with statistically valid confidence bounds, even when significant skew exists.