SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
CONTROL: continuous output and navigation technology with refinement on-line
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Ripple joins for online aggregation
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
A scalable hash ripple join algorithm
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
ACM Transactions on Database Systems (TODS)
Scalable approximate query processing with the DBO engine
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Turbo-charging estimate convergence in DBO
Proceedings of the VLDB Endowment
Distributed online aggregations
Proceedings of the VLDB Endowment
PR-join: a non-blocking join achieving higher early result rate with statistical guarantees
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
GLADE: a scalable framework for efficient analytics
ACM SIGOPS Operating Systems Review
GLADE: big data analytics made easy
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Early accurate results for advanced analytics on MapReduce
Proceedings of the VLDB Endowment
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches
Foundations and Trends in Databases
Blink and it's done: interactive queries on very large data
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
Online aggregation provides estimates to the final result of a computation during the actual processing. The user can stop the computation as soon as the estimate is accurate enough, typically early in the execution. When coupled with parallel processing, this allows for the interactive data exploration of the largest datasets. In this paper, we identify the main functionality requirements of sampling-based parallel online aggregation--partial aggregation, parallel sampling, and estimation. We argue for overlapped online aggregation as the only scalable solution to combine computation and estimation. We analyze the properties of existent estimators and design a novel sampling-based estimator that is robust to node delay and failure. When executed over a massive 8TB TPC-H instance, the proposed estimator provides accurate confidence bounds early in the execution even when the cardinality of the final result is seven orders of magnitude smaller than the dataset size and achieves linear scalability.