Processing online aggregation on skewed data in mapreduce

Authors:
Yantao Gan;Xiaofeng Meng;Yingjie Shi
Affiliations:
Renmin University of China, Beijing, China;Renmin University of China, Beijing, China;Renmin University of China, Beijing, China
Venue:
Proceedings of the fifth international workshop on Cloud data management
Year:
2013

Citing 15
Cited 0

Random sampling from B+ trees

VLDB '89 Proceedings of the 15th international conference on Very large data bases
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Ripple joins for online aggregation

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
A scalable hash ripple join algorithm

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Random Sampling from Pseudo-Ranked B+ Trees

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Random Sampling from Database Files: A Survey

Proceedings of the 5th International Conference SSDBM on Statistical and Scientific Database Management
Large-Sample and Deterministic Confidence Intervals for Online Aggregation

SSDBM '97 Proceedings of the Ninth International Conference on Scientific and Statistical Database Management
The Sort-Merge-Shrink join

ACM Transactions on Database Systems (TODS)
Scalable approximate query processing with the DBO engine

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Distributed online aggregations

Proceedings of the VLDB Endowment
PR-join: a non-blocking join achieving higher early result rate with statistical guarantees

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Online aggregation and continuous query support in MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
You can stop early with COLA: online processing of aggregate queries in the cloud

Proceedings of the 21st ACM international conference on Information and knowledge management
COLA: A cloud-based system for online aggregation

ICDE '13 Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In online aggregation, a system constantly maintains an estimate of the final answer to an aggregate query throughout execution, along with statistically meaningful bounds for the estimate's accuracy. Given the popularity of ad-hoc analytic query processing over enormous datasets, providing online aggregation in a large-scale, MapReduce environment is therefore an emerging important application need. However, existing work targeted at single-node centralized environment cannot be easily extended to fit the MapReduce paradigm. The substantial challenge lies in, given a number of input blocks, and given the prevalence of data skew, the runtime of upstream operators is uneven, so the set of intermediate results delivered to downstream operators at any particular point cannot be seen as a random sample, leading to biased estimates. In this paper, we analyze how data skew breaks the randomness in the distributed environment. To address that, we present a keep-order approach that accounts for biases that can arise when estimating aggregates over skewed dataset in a distributed environment. Moreover, we provide a pre-computing method to promise a fast result rate. A set of experiments indicates that our method can provide reasonable precise estimates early in the execution with statistically valid confidence bounds, even when significant skew exists.