Consistently estimating the selectivity of conjuncts of predicates

Authors:
V. Markl;N. Megiddo;M. Kutsch;T. M. Tran;P. Haas;U. Srivastava
Affiliations:
IBM Almaden Research Center;IBM Almaden Research Center;IBM Germany;IBM Silicon Valley Lab;IBM Almaden Research Center;Stanford University
Venue:
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Year:
2005

Citing 24
Cited 19

On the propagation of errors in the size of join results

SIGMOD '91 Proceedings of the 1991 ACM SIGMOD international conference on Management of data
Multiple join size estimation by virtual domains (extended abstract)

PODS '93 Proceedings of the twelfth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
On the estimation of join result sizes

EDBT '94 Proceedings of the 4th international conference on extending database technology: Advances in database technology
Improved histograms for selectivity estimation of range predicates

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Self-tuning histograms: building histograms without looking at data

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
The maximum entropy approach and probabilistic IR models

ACM Transactions on Information Systems (TOIS)
Independence is good: dependency-based histogram synopses for high-dimensional data

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
STHoles: a multidimensional workload-aware histogram

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Selectivity estimation using probabilistic models

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Exploiting statistics on query expressions for optimization

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Access path selection in a relational database management system

SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Accurate estimation of the number of tuples satisfying a condition

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Sampling-Based Selectivity Estimation for Joins Using Augmented Frequent Value Statistics

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Probabilistic Optimization of Top N Queries

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Selectivity Estimation and Query Optimization in Large Databases with Highly Skewed Distribution of Column Values

VLDB '88 Proceedings of the 14th International Conference on Very Large Data Bases
LEO - DB2's LEarning Optimizer

Proceedings of the 27th International Conference on Very Large Data Bases
Selectivity Estimation Without the Attribute Value Independence Assumption

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Automating Statistics Management for Query Optimizers

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Oracle Database 10g New Features: Oracle10g Reference for Advanced Tuning and Administration

Oracle Database 10g New Features: Oracle10g Reference for Advanced Tuning and Administration
Conditional selectivity for statistics on query expressions

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
CORDS: automatic discovery of correlations and soft functional dependencies

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Refined lexicon models for statistical machine translation using a maximum entropy approach

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Statistics on views

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Automated statistics collection in DB2 UDB

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30

Towards estimating the number of distinct value combinations for a set of attributes

Proceedings of the 14th ACM international conference on Information and knowledge management
Towards correcting input data errors probabilistically using integrity constraints

MobiDE '06 Proceedings of the 5th ACM international workshop on Data engineering for wireless and mobile access
MAXENT: consistent cardinality estimation in action

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Consistent selectivity estimation via maximum entropy

The VLDB Journal — The International Journal on Very Large Data Bases
Discovering and exploiting keyword and attribute-value co-occurrences to improve P2P routing indices

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Exploiting correlated keywords to improve approximate information filtering

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
A new approach to building histogram for selectivity estimation in query processing optimization

Computers & Mathematics with Applications
Query optimizers: time to rethink the contract?

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
General Database Statistics Using Entropy Maximization

DBPL '09 Proceedings of the 12th International Symposium on Database Programming Languages
Measure-driven keyword-query expansion

Proceedings of the VLDB Endowment
Consistent histograms in the presence of distinct value counts

Proceedings of the VLDB Endowment
Understanding cardinality estimation using entropy maximization

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Understanding cardinality estimation using entropy maximization

ACM Transactions on Database Systems (TODS)
HASE: a hybrid approach to selectivity estimation for conjunctive predicates

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Progressive query optimization for federated queries

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Integrating a maximum-entropy cardinality estimator into DB2 UDB

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Worst-case optimal join algorithms: [extended abstract]

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Optimizing ranked retrieval

DEXA'07 Proceedings of the 18th international conference on Database and Expert Systems Applications
Entropy-based histograms for selectivity estimation

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cost-based query optimizers need to estimate the selectivity of conjunctive predicates when comparing alternative query execution plans. To this end, advanced optimizers use multivariate statistics (MVS) to improve information about the joint distribution of attribute values in a table. The joint distribution for all columns is almost always too large to store completely, and the resulting use of partial distribution information raises the possibility that multiple, non-equivalent selectivity estimates may be available for a given predicate. Current optimizers use ad hoc methods to ensure that selectivities are estimated in a consistent manner. These methods ignore valuable information and tend to bias the optimizer toward query plans for which the least information is available, often yielding poor results. In this paper we present a novel method for consistent selectivity estimation based on the principle of maximum entropy (ME). Our method efficiently exploits all available information and avoids the bias problem. In the absence of detailed knowledge, the ME approach reduces to standard uniformity and independence assumptions. Our implementation using a prototype version of DB2 UDB shows that ME improves the optimizer's cardinality estimates by orders of magnitude, resulting in better plan quality and significantly reduced query execution times.