VLDB '89 Proceedings of the 15th international conference on Very large data bases
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Ripple joins for online aggregation
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
A scalable hash ripple join algorithm
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Random Sampling from Pseudo-Ranked B+ Trees
VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Random Sampling from Database Files: A Survey
Proceedings of the 5th International Conference SSDBM on Statistical and Scientific Database Management
Large-Sample and Deterministic Confidence Intervals for Online Aggregation
SSDBM '97 Proceedings of the Ninth International Conference on Scientific and Statistical Database Management
ACM Transactions on Database Systems (TODS)
Scalable approximate query processing with the DBO engine
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Distributed online aggregations
Proceedings of the VLDB Endowment
PR-join: a non-blocking join achieving higher early result rate with statistical guarantees
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Online aggregation and continuous query support in MapReduce
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
You can stop early with COLA: online processing of aggregate queries in the cloud
Proceedings of the 21st ACM international conference on Information and knowledge management
COLA: A cloud-based system for online aggregation
ICDE '13 Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013)
Hi-index | 0.00 |
In online aggregation, a system constantly maintains an estimate of the final answer to an aggregate query throughout execution, along with statistically meaningful bounds for the estimate's accuracy. Given the popularity of ad-hoc analytic query processing over enormous datasets, providing online aggregation in a large-scale, MapReduce environment is therefore an emerging important application need. However, existing work targeted at single-node centralized environment cannot be easily extended to fit the MapReduce paradigm. The substantial challenge lies in, given a number of input blocks, and given the prevalence of data skew, the runtime of upstream operators is uneven, so the set of intermediate results delivered to downstream operators at any particular point cannot be seen as a random sample, leading to biased estimates. In this paper, we analyze how data skew breaks the randomness in the distributed environment. To address that, we present a keep-order approach that accounts for biases that can arise when estimating aggregates over skewed dataset in a distributed environment. Moreover, we provide a pre-computing method to promise a fast result rate. A set of experiments indicates that our method can provide reasonable precise estimates early in the execution with statistically valid confidence bounds, even when significant skew exists.