Histograms as statistical estimators for aggregate queries

Authors:
Lixia Chen;Alin Dobra
Affiliations:
Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, United States;Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, United States
Venue:
Information Systems
Year:
2013

Citing 40
Cited 0

Equi-depth multidimensional histograms

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
On the propagation of errors in the size of join results

SIGMOD '91 Proceedings of the 1991 ACM SIGMOD international conference on Management of data
Optimal histograms for limiting worst-case error propagation in the size of join results

ACM Transactions on Database Systems (TODS)
Balancing histogram optimality and practicality for query result size estimation

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Estimating alphanumeric selectivity in the presence of wildcards

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Improved histograms for selectivity estimation of range predicates

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
The space complexity of approximating the frequency moments

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
New sampling-based summary statistics for improving approximate query answers

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Tracking join and self-join sizes in limited storage

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Self-tuning histograms: building histograms without looking at data

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Multi-dimensional selectivity estimation using compressed histogram information

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
On random sampling over joins

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Join synopses for approximate query answering

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Ripple joins for online aggregation

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Approximating multi-dimensional aggregate range queries over real attributes

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
STHoles: a multidimensional workload-aware histogram

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Fast algorithms for hierarchical range histogram construction

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Processing complex aggregate queries over data streams

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Dynamic multidimensional histograms

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Accurate estimation of the number of tuples satisfying a condition

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Using histograms to estimate answer sizes for XML queries

Information Systems - Special issue: Best papers from EDBT 2002
Estimating Answer Sizes for XML Queries

EDBT '02 Proceedings of the 8th International Conference on Extending Database Technology: Advances in Database Technology
Optimal Histograms with Quality Guarantees

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Multi-Dimensional Substring Selectivity Estimation

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Universality of Serial Histograms

VLDB '93 Proceedings of the 19th International Conference on Very Large Data Bases
Estimation of Query-Result Distribution and its Application in Parallel-Join Load Balancing

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Selectivity Estimation Without the Attribute Value Independence Assumption

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
The optimization of queries in relational databases

The optimization of queries in relational databases
ISOMER: Consistent Histogram Construction Using Query Feedback

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Approximation and streaming algorithms for histogram construction problems

ACM Transactions on Database Systems (TODS)
XSKETCH synopses for XML data graphs

ACM Transactions on Database Systems (TODS)
Scalable approximate query processing with the DBO engine

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
XPathLearner: an on-line self-tuning Markov histogram for XML path selectivity estimation

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
The history of histograms (abridged)

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
SASH: a self-adaptive histogram set for dynamically changing workloads

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
REHIST: relative error histogram construction algorithms

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Histograms and Wavelets on Probabilistic Data

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Probabilistic histograms for probabilistic data

Proceedings of the VLDB Endowment
Consistent histograms in the presence of distinct value counts

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

The traditional statistical assumption for interpreting histograms and justifying approximate query processing methods based on them is that all elements in a bucket have the same frequency-this is called uniform distribution assumption. In this paper, we analyze histograms from a statistical point of view. We show that a significantly less restrictive statistical assumption - the elements within a bucket are randomly arranged even though they might have different frequencies - leads to identical formulas for approximating aggregate queries using histograms. Under this assumption, we analyze the behavior of both unidimensional and multidimensional histograms and provide tight error guarantees for the quality of approximations. We conclude that histograms are the best estimators if the assumption holds; sampling and sketching are significantly worse. As an example of how the statistical theory of histograms can be extended, we show how XSketches - an approximation technique for XML queries that uses histograms as building blocks - can be statistically analyzed. The combination of the random shuffling assumption and the other statistical assumptions associated with XSketch estimators ensures a complete statistical model and error analysis for XSketches.