Global optimization of histograms

Authors:
H. V. Jagadish;Hui Jin;Beng Chin Ooi;Kian-Lee Tan
Affiliations:
Department of Electrical Engineering and Computer Science, University of Michigan;Department of Computer Science, National University of Singapore;Department of Computer Science, National University of Singapore;Department of Computer Science, National University of Singapore
Venue:
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Year:
2001

Citing 15
Cited 23

Equi-depth multidimensional histograms

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Statistical profile estimation in database systems

ACM Computing Surveys (CSUR)
On the propagation of errors in the size of join results

SIGMOD '91 Proceedings of the 1991 ACM SIGMOD international conference on Management of data
Balancing histogram optimality and practicality for query result size estimation

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Improved histograms for selectivity estimation of range predicates

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
New sampling-based summary statistics for improving approximate query answers

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Wavelet-based histograms for selectivity estimation

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
The Aqua approximate query answering system

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Implications of certain assumptions in database performance evauation

ACM Transactions on Database Systems (TODS)
Access path selection in a relational database management system

SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Accurate estimation of the number of tuples satisfying a condition

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Optimal Histograms with Quality Guarantees

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Universality of Serial Histograms

VLDB '93 Proceedings of the 19th International Conference on Very Large Data Bases
Selectivity Estimation Without the Attribute Value Independence Assumption

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Automating Statistics Management for Query Optimizers

ICDE '00 Proceedings of the 16th International Conference on Data Engineering

Executing SQL over encrypted data in the database-service-provider model

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Automatic tuning of data synopses

Information Systems - Special issue: Best papers from EDBT 2002
Supporting Efficient Parametric Search of E-Commerce Data: A Loosely-Coupled Solution

EDBT '02 Proceedings of the 8th International Conference on Extending Database Technology: Advances in Database Technology
A Framework for the Physical Design Problem for Data Synopses

EDBT '02 Proceedings of the 8th International Conference on Extending Database Technology: Advances in Database Technology
On Linear-Spline Based Histograms

WAIM '02 Proceedings of the Third International Conference on Advances in Web-Age Information Management
A multi-dimensional histogram for selectivity estimation and fast approximate query answering

CASCON '03 Proceedings of the 2003 conference of the Centre for Advanced Studies on Collaborative research
Hierarchical binary histograms for summarizing multi-dimensional data

Proceedings of the 2005 ACM symposium on Applied computing
Error minimization in approximate range aggregates

Data & Knowledge Engineering
Self-tuning database technology and information services: from wishful thinking to viable engineering

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Efficient Process of Top-k Range-Sum Queries over Multiple Streams with Minimized Global Error

IEEE Transactions on Knowledge and Data Engineering
The history of histograms (abridged)

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
SASH: a self-adaptive histogram set for dynamically changing workloads

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Discovering gis sources on the web using summaries

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Enhancing histograms by tree-like bucket indices

The VLDB Journal — The International Journal on Very Large Data Bases
Compressed hierarchical binary histograms for summarizing multi-dimensional data

Knowledge and Information Systems
Multiplicative synopses for relative-error metrics

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Enabling OLAP in mobile environments via intelligent data cube compression techniques

Journal of Intelligent Information Systems
Fast and effective histogram construction

Proceedings of the 18th ACM conference on Information and knowledge management
Synopses for probabilistic data over large domains

Proceedings of the 14th International Conference on Extending Database Technology
A quad-tree based multiresolution approach for two-dimensional summary data

Information Systems
Self-adaptive statistics management for efficient query processing

WAIM'05 Proceedings of the 6th international conference on Advances in Web-Age Information Management
Synopses reconciliation via calibration in the τ-synopses system

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Histograms are frequently used to represent the distribution of data values in an attribute of a relation. Most previous work has focused on identifying the optimal histogram (given a limited number of buckets) for a single attribute independent of other attributes/histograms. In this paper, we propose the idea of global optimization of histograms, i.e., single-attribute histograms for a set of attributes are optimized collectively so as to minimize the overall error in using the histograms. The idea is to allocate more buckets to histograms whose attributes are more frequently used and/or distributions are highly skewed. While the accuracy of some histograms is penalized (being assigned fewer buckets), we expect the global error to be low compared to the traditional method (of allocating equal number of buckets to each histogram).We propose two algorithms to determine the histograms to construct for a collection of attributes. The first is based on dynamic programming, and the second is a greedy algorithm. We compare the overall error of these algorithms against the traditional method. Extensive experiments are conducted and the results confirm the benefits of global optimal histograms in reducing the overall error. The extent of improvement depends on the data and query distributions, ranging from no benefit when there is no significant differences in the data distributions to over a factor of 100 reduction in error in some cases we tried.The time to compute global optimal histogram using dynamic programming is much longer than the time to compute optimal histograms separately for each attribute, and the difference widens at a faster rate as the number of histograms increases. With the greedy algorithm, the time penalty is small, but the error reduction is somewhat less as well. We propose a third algorithm, called greedy algorithm with remedy, that has running time similar to the greedy algorithm, but produces results close to global optimum. In fact, in every experiment that we tried, this algorithm found the exact global optimum.