Histograms revisited: when are histograms the best approximation method for aggregates over joins?

Authors:
Alin Dobra
Affiliations:
University of Florida, Gainesville, FL
Venue:
Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Year:
2005

Citing 10
Cited 5

On the propagation of errors in the size of join results

SIGMOD '91 Proceedings of the 1991 ACM SIGMOD international conference on Management of data
Optimal histograms for limiting worst-case error propagation in the size of join results

ACM Transactions on Database Systems (TODS)
Balancing histogram optimality and practicality for query result size estimation

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
The space complexity of approximating the frequency moments

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Tracking join and self-join sizes in limited storage

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
On random sampling over joins

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Processing complex aggregate queries over data streams

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Accurate estimation of the number of tuples satisfying a condition

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Universality of Serial Histograms

VLDB '93 Proceedings of the 19th International Conference on Very Large Data Bases
Random sampling from database files: a survey

SSDBM'1990 Proceedings of the 5th international conference on Statistical and Scientific Database Management

The Sort-Merge-Shrink join

ACM Transactions on Database Systems (TODS)
Scalable approximate query processing with the DBO engine

ACM Transactions on Database Systems (TODS)
Statistical structures for Internet-scale data management

The VLDB Journal — The International Journal on Very Large Data Bases
The VC-dimension of SQL queries and selectivity estimation through sampling

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases

Quantified Score

Hi-index	0.00

Visualization

Abstract

The traditional statistical assumption for interpreting histograms and justifying approximate query processing methods based on them is that all elements in a bucket have the same frequency -- the so called uniform distribution assumption. In this paper we show that a significantly less restrictive statistical assumption - the elements within a bucket are randomly arranged even though they might have different frequencies -- leads to identical formulae for approximating aggregate queries using histograms. This observation allows us to identify scenarios in which histograms are well suited as approximation methods -- in fact we show that in these situations sampling and sketching are significantly worse -- and provide tight error guarantees for the quality of approximations. At the same time we show that, on average, histograms are rather poor approximators outside these scenarios.