Sampling estimators for parallel online aggregation

Authors:
Chengjie Qin;Florin Rusu
Affiliations:
University of California, Merced;University of California, Merced
Venue:
BNCOD'13 Proceedings of the 29th British National conference on Big Data
Year:
2013

Citing 16
Cited 0

Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
CONTROL: continuous output and navigation technology with refinement on-line

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Ripple joins for online aggregation

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
A scalable hash ripple join algorithm

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
The Sort-Merge-Shrink join

ACM Transactions on Database Systems (TODS)
Scalable approximate query processing with the DBO engine

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
The DBO database system

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Turbo-charging estimate convergence in DBO

Proceedings of the VLDB Endowment
Distributed online aggregations

Proceedings of the VLDB Endowment
PR-join: a non-blocking join achieving higher early result rate with statistical guarantees

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
MapReduce online

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
GLADE: a scalable framework for efficient analytics

ACM SIGOPS Operating Systems Review
GLADE: big data analytics made easy

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Early accurate results for advanced analytics on MapReduce

Proceedings of the VLDB Endowment
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases
Blink and it's done: interactive queries on very large data

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Online aggregation provides estimates to the final result of a computation during the actual processing. The user can stop the computation as soon as the estimate is accurate enough, typically early in the execution. When coupled with parallel processing, this allows for the interactive data exploration of the largest datasets. In this paper, we identify the main functionality requirements of sampling-based parallel online aggregation--partial aggregation, parallel sampling, and estimation. We argue for overlapped online aggregation as the only scalable solution to combine computation and estimation. We analyze the properties of existent estimators and design a novel sampling-based estimator that is robust to node delay and failure. When executed over a massive 8TB TPC-H instance, the proposed estimator provides accurate confidence bounds early in the execution even when the cardinality of the final result is seven orders of magnitude smaller than the dataset size and achieves linear scalability.