Robust estimation with sampling and approximate pre-aggregation

Authors:
Christopher Jermaine
Affiliations:
Department of Computer and Information Sciences and Engineering, University of Florida, Gainesville, FL
Venue:
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Year:
2003

Citing 17
Cited 7

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
A modified Newton method for constrained estimation in covariance structure analysis

Computational Statistics & Data Analysis
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
New sampling-based summary statistics for improving approximate query answers

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Approximate computation of multidimensional aggregates of sparse data using wavelets

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Join synopses for approximate query answering

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Approximating multi-dimensional aggregate range queries over real attributes

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Congressional samples for approximate answering of group-by queries

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
A robust, optimization-based approach for approximate answering of aggregate queries

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Applying the golden rule of sampling for query estimation

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Wavelet synopses with error guarantees

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Overcoming Limitations of Sampling for Aggregation Queries

Proceedings of the 17th International Conference on Data Engineering
Optimal Histograms with Quality Guarantees

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
ICICLES: Self-Tuning Samples for Approximate Query Answering

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Selectivity Estimation Without the Attribute Value Independence Assumption

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Recovering Information from Summary Data

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Fast Incremental Maintenance of Approximate Histograms

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases

Online maintenance of very large random samples

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Derby/S: a DBMS for sample-based query answering

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Random Sampling for Continuous Streams with Arbitrary Updates

IEEE Transactions on Knowledge and Data Engineering
Optimized stratified sampling for approximate query processing

ACM Transactions on Database Systems (TODS)
Maintaining very large random samples using the geometric file

The VLDB Journal — The International Journal on Very Large Data Bases
Sample synopses for approximate answering of group-by queries

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Effective and efficient sampling methods for deep web aggregation queries

Proceedings of the 14th International Conference on Extending Database Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

The majority of data reduction techniques for approximate query processing (such as wavelets, histograms, kernels, and so on) are not usually applicable to categorical data. There has been something of a disconnect between research in this area and the reality of data-base data; much recent research has focused on approximate query processing over ordered or numerical attributes, but arguably the majority of database attributes are categorical: country, state, job_title, color, sex, department, and so on. This paper considers the problem of approximation of aggregate functions over categorical data, or mixed categorical/numerical data. We propose a method based upon random sampling, called Approximate Pre-Aggregation (APA). The biggest drawback of sampling for aggregate function estimating is the sensitivity of sampling to attribute value skew, and APA uses several techniques to overcome this sensitivity. The increase in accuracy using APA compared to "plain vanilla" sampling is dramatic. For SUM and AVG queries, the relative error for random sampling alone is more than 700% greater than for sampling with APA. Even if stratified sampling techniques are used, the error is still between 28% and 175% greater than for APA.