Consistent selectivity estimation via maximum entropy

Authors:
V. Markl;P. J. Haas;M. Kutsch;N. Megiddo;U. Srivastava;T. M. Tran
Affiliations:
IBM Almaden Research Center, USA;IBM Almaden Research Center, USA;IBM Germany, Boeblingen;IBM Almaden Research Center, USA;Stanford University, USA;IBM Silicon Valley Lab, USA
Venue:
The VLDB Journal — The International Journal on Very Large Data Bases
Year:
2007

Citing 26
Cited 12

On the propagation of errors in the size of join results

SIGMOD '91 Proceedings of the 1991 ACM SIGMOD international conference on Management of data
Multiple join size estimation by virtual domains (extended abstract)

PODS '93 Proceedings of the twelfth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
On the estimation of join result sizes

EDBT '94 Proceedings of the 4th international conference on extending database technology: Advances in database technology
Improved histograms for selectivity estimation of range predicates

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Self-tuning histograms: building histograms without looking at data

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
The maximum entropy approach and probabilistic IR models

ACM Transactions on Information Systems (TOIS)
Independence is good: dependency-based histogram synopses for high-dimensional data

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
STHoles: a multidimensional workload-aware histogram

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Selectivity estimation using probabilistic models

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Exploiting statistics on query expressions for optimization

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Access path selection in a relational database management system

SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Accurate estimation of the number of tuples satisfying a condition

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Correction to 'Automating Statistics Management for Query Optimizers'

IEEE Transactions on Knowledge and Data Engineering
Sampling-Based Selectivity Estimation for Joins Using Augmented Frequent Value Statistics

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Selectivity Estimation and Query Optimization in Large Databases with Highly Skewed Distribution of Column Values

VLDB '88 Proceedings of the 14th International Conference on Very Large Data Bases
LEO - DB2's LEarning Optimizer

Proceedings of the 27th International Conference on Very Large Data Bases
Selectivity Estimation Without the Attribute Value Independence Assumption

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Oracle Database 10g New Features: Oracle10g Reference for Advanced Tuning and Administration

Oracle Database 10g New Features: Oracle10g Reference for Advanced Tuning and Administration
Conditional selectivity for statistics on query expressions

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
CORDS: automatic discovery of correlations and soft functional dependencies

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Refined lexicon models for statistical machine translation using a maximum entropy approach

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Consistently estimating the selectivity of conjuncts of predicates

VLDB '05 Proceedings of the 31st international conference on Very large data bases
ISOMER: Consistent Histogram Construction Using Query Feedback

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Statistics on views

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Automated statistics collection in DB2 UDB

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Integrating a maximum-entropy cardinality estimator into DB2 UDB

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology

Brighthouse: an analytic data warehouse for ad-hoc queries

Proceedings of the VLDB Endowment
Uncertainty management in rule-based information extraction systems

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Filtered statistics

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Automated SQL tuning through trial and (sometimes) error

Proceedings of the Second International Workshop on Testing Database Systems
Statistical structures for Internet-scale data management

The VLDB Journal — The International Journal on Very Large Data Bases
Structured annotations of web queries

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Link gain matrix estimation in distributed large-scale wireless networks

EURASIP Journal on Wireless Communications and Networking - Special issue on simulators and experimental testbeds design and development for wireless networks
Xplus: a SQL-tuning-aware query optimizer

Proceedings of the VLDB Endowment
Instant anonymization

ACM Transactions on Database Systems (TODS)
The VC-dimension of SQL queries and selectivity estimation through sampling

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
Efficiently adapting graphical models for selectivity estimation

The VLDB Journal — The International Journal on Very Large Data Bases
Entropy-based histograms for selectivity estimation

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cost-based query optimizers need to estimate the selectivity of conjunctive predicates when comparing alternative query execution plans. To this end, advanced optimizers use multivariate statistics to improve information about the joint distribution of attribute values in a table. The joint distribution for all columns is almost always too large to store completely, and the resulting use of partial distribution information raises the possibility that multiple, non-equivalent selectivity estimates may be available for a given predicate. Current optimizers use cumbersome ad hoc methods to ensure that selectivities are estimated in a consistent manner. These methods ignore valuable information and tend to bias the optimizer toward query plans for which the least information is available, often yielding poor results. In this paper we present a novel method for consistent selectivity estimation based on the principle of maximum entropy (ME). Our method exploits all available information and avoids the bias problem. In the absence of detailed knowledge, the ME approach reduces to standard uniformity and independence assumptions. Experiments with our prototype implementation in DB2 UDB show that use of the ME approach can improve the optimizer’s cardinality estimates by orders of magnitude, resulting in better plan quality and significantly reduced query execution times. For almost all queries, these improvements are obtained while adding only tens of milliseconds to the overall time required for query optimization.