Structure choices for two-dimensional histogram construction

Authors:
Hang T. A. Pham;Kenneth C. Sevcik
Affiliations:
Department of Computer Science, University of Toronto;Department of Computer Science, University of Toronto
Venue:
CASCON '04 Proceedings of the 2004 conference of the Centre for Advanced Studies on Collaborative research
Year:
2004

Citing 20
Cited 2

Equi-depth multidimensional histograms

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Optimal histograms for limiting worst-case error propagation in the size of join results

ACM Transactions on Database Systems (TODS)
Balancing histogram optimality and practicality for query result size estimation

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Improved histograms for selectivity estimation of range predicates

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Histogram-based estimation techniques in database systems

Histogram-based estimation techniques in database systems
Wavelet-based histograms for selectivity estimation

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Selectivity estimation in spatial databases

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Self-tuning histograms: building histograms without looking at data

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Multi-dimensional selectivity estimation using compressed histogram information

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Approximating multi-dimensional aggregate range queries over real attributes

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
STHoles: a multidimensional workload-aware histogram

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Accurate estimation of the number of tuples satisfying a condition

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Optimal Histograms with Quality Guarantees

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Histogram-Based Approximation of Set-Valued Query-Answers

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Approximate Query Processing Using Wavelets

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Universality of Serial Histograms

VLDB '93 Proceedings of the 19th International Conference on Very Large Data Bases
Selectivity Estimation Without the Attribute Value Independence Assumption

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Utilizing histogram information

CASCON '01 Proceedings of the 2001 conference of the Centre for Advanced Studies on Collaborative research
The optimization of queries in relational databases

The optimization of queries in relational databases
A multi-dimensional histogram for selectivity estimation and fast approximate query answering

CASCON '03 Proceedings of the 2003 conference of the Centre for Advanced Studies on Collaborative research

AQUAGP: approximate QUery answers using genetic programming

EuroGP'06 Proceedings of the 9th European conference on Genetic Programming
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Histograms of the distributions of individual attributes are currently used in leading database management systems (e.g., IBM DB2, Oracle Database, and Microsoft SQL server). Because attribute pairs in databases are seldom independent, however, the use of the distributions of individual attributes with the attribute independence assumption often leads to poor estimates. More accurate answers can be obtained by using multi-dimensional histograms to characterize the joint distribution of two or more attributes. When moving from one-dimensional to two-dimensional histograms, several new issues relating to histogram structure arise: (1) Which attribute should take priority over the other with respect to data partitioning?; (2) Into how many partitions should each dimension be split to obtain a desired number of histogram buckets?; and (3) How many most frequent values should be isolated and stored in singleton buckets? In the context of real data, we experimentally show that our proposed methods for dealing with histogram structure choices lead to good quality histograms for a variety of histogram partitioning techniques and various types of data distributions.