Large-Sample and Deterministic Confidence Intervals for Online Aggregation

Authors:
Peter J. Haas
Affiliations:
-
Venue:
SSDBM '97 Proceedings of the Ninth International Conference on Scientific and Statistical Database Management
Year:
1997

Citing 3
Cited 43

Selectivity and cost estimation for joins based on random sampling

Journal of Computer and System Sciences
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
The Case for Online Aggregation

The Case for Online Aggregation

Join synopses for approximate query answering

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Ripple joins for online aggregation

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
A scalable hash ripple join algorithm

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Informix under CONTROL: Online Query Processing

Data Mining and Knowledge Discovery
Approximate Query Answering Using Data Warehouse Striping

Journal of Intelligent Information Systems - Special issue on data warehousing and knowledge discovery
High-dimensional nearest neighbor search with remote data centers

Knowledge and Information Systems
Online Feedback for Nested Aggregate Queries with Multi-Threading

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Online Dynamic Reordering for Interactive Data Processing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Approximate Query Answering Using Data Warehouse Striping

DaWaK '01 Proceedings of the Third International Conference on Data Warehousing and Knowledge Discovery
Time-Interval Sampling for Improved Estimations in Data Warehouses

DaWaK 2000 Proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery
Online dynamic reordering

The VLDB Journal — The International Journal on Very Large Data Bases
Progressive evaluation of nested aggregate queries

The VLDB Journal — The International Journal on Very Large Data Bases
Approximate query processing using wavelets

The VLDB Journal — The International Journal on Very Large Data Bases
DSQoS-distributed architecture providing QoS in summary warehouses

DOLAP '03 Proceedings of the 6th ACM international workshop on Data warehousing and OLAP
A disk-based join with probabilistic guarantees

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
When can we trust progress estimators for SQL queries?

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Relational confidence bounds are easy with the bootstrap

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Integrated resource management for data stream systems

Proceedings of the 2005 ACM symposium on Applied computing
Online estimation for subset-based SQL queries

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Derby/S: a DBMS for sample-based query answering

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
The Sort-Merge-Shrink join

ACM Transactions on Database Systems (TODS)
Cardinality estimation using sample views with quality assurance

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Scalable approximate query processing with the DBO engine

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Incorporating quality aspects in sensor data streams

Proceedings of the ACM first Ph.D. workshop in CIKM
Supporting time-constrained SQL queries in oracle

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
An interactive framework for raster data spatial joins

Proceedings of the 15th annual ACM international symposium on Advances in geographic information systems
DAWN: an efficient framework of DCT for data with error estimation

The VLDB Journal — The International Journal on Very Large Data Bases
A research agenda for query processing in large-scale peer data management systems

Information Systems
Scalable approximate query processing with the DBO engine

ACM Transactions on Database Systems (TODS)
The design of a query monitoring system

ACM Transactions on Database Systems (TODS)
Representing Data Quality in Sensor Data Streaming Environments

Journal of Data and Information Quality (JDIQ)
Turbo-charging estimate convergence in DBO

Proceedings of the VLDB Endowment
An experimental study of time-constrained aggregate queries

Proceedings of the 13th International Conference on Extending Database Technology
Continuous sampling for online aggregation over multiple queries

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
IRSJ: incremental refining spatial joins for interactive queries in GIS

Geoinformatica
A probabilistic framework for estimating the accuracy of aggregate range queries evaluated over histograms

Information Sciences: an International Journal
An interactive framework for spatial joins: a statistical approach to data analysis in GIS

Geoinformatica
An incremental refining spatial join algorithm for estimating query results in GIS

DEXA'06 Proceedings of the 17th international conference on Database and Expert Systems Applications
Improving online aggregation performance for skewed data distribution

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part I
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases
You can stop early with COLA: online processing of aggregate queries in the cloud

Proceedings of the 21st ACM international conference on Information and knowledge management
Processing online aggregation on skewed data in mapreduce

Proceedings of the fifth international workshop on Cloud data management
A sampling algebra for aggregate estimation

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.01

Visualization

Abstract

The online aggregation system recently proposed by Hellerstein, et al. permits interactive exploration of large, complex datasets stored in relational database management systems. Running confidence intervals are an important component of an online aggregation system and indicate to the user the estimated proximity of each running aggregate to the corresponding final result. Large-sample confidence intervals contain the final result with a prespecified probability and rest on central limit theorems, while deterministic confidence intervals contain the final query result with probability 1. In this paper we show how new and existing central limit theorems, simple bounding arguments, and the delta method can be used to derive formulas for both large-sample and deterministic confidence intervals. To illustrate these techniques, we obtain formulas for running confidence intervals in the case of single-table and multi-table AVG, COUNT, SUM, VARIANCE, and STDEV queries with join and selection predicates. Duplicate-elimination and GROUP-BY operations are also considered. We then provide numerically stable algorithms for computing the confidence intervals and analyze the complexity of these algorithms.