The VC-dimension of SQL queries and selectivity estimation through sampling

Authors:
Matteo Riondato;Mert Akdere;Uǧur Çetintemel;Stanley B. Zdonik;Eli Upfal
Affiliations:
Department of Computer Science, Brown University, Providence, RI;Department of Computer Science, Brown University, Providence, RI;Department of Computer Science, Brown University, Providence, RI;Department of Computer Science, Brown University, Providence, RI;Department of Computer Science, Brown University, Providence, RI
Venue:
ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
Year:
2011

Citing 47
Cited 0

Practical selectivity estimation through adaptive sampling

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Error-constrained COUNT query evaluation in relational databases

SIGMOD '91 Proceedings of the 1991 ACM SIGMOD international conference on Management of data
Sequential sampling procedures for query size estimation

SIGMOD '92 Proceedings of the 1992 ACM SIGMOD international conference on Management of data
Fixed-precision estimation of join selectivity

PODS '93 Proceedings of the twelfth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
On the relative cost of sampling for join selectivity estimation

PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Query size estimation by adaptive sampling

Selected papers of the 9th annual ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Balancing histogram optimality and practicality for query result size estimation

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Bifocal sampling for skew-resistant join size estimation

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Improved histograms for selectivity estimation of range predicates

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Selectivity and cost estimation for joins based on random sampling

Journal of Computer and System Sciences
New sampling-based summary statistics for improving approximate query answers

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Random sampling for histogram construction: how much is enough?

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Wavelet-based histograms for selectivity estimation

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
On random sampling over joins

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Join synopses for approximate query answering

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Statistical estimators for relational algebra expressions

Proceedings of the seventh ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
STHoles: a multidimensional workload-aware histogram

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Applying the golden rule of sampling for query estimation

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Selectivity estimation using probabilistic models

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Improved bounds on the sample complexity of learning

Journal of Computer and System Sciences
The discrepancy method: randomness and complexity

The discrepancy method: randomness and complexity
Database Systems: The Complete Book

Database Systems: The Complete Book
Lectures on Discrete Geometry

Lectures on Discrete Geometry
Selectivity Estimation in the Presence of Alphanumeric Correlations

ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
Selectivity Estimation Using Homogeneity Measurement

Proceedings of the Sixth International Conference on Data Engineering
Sampling-Based Selectivity Estimation for Joins Using Augmented Frequent Value Statistics

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Optimal Histograms with Quality Guarantees

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
ICICLES: Self-Tuning Samples for Approximate Query Answering

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Selectivity Estimation Without the Attribute Value Independence Assumption

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Query Size Estimation Using Machine Learning

Proceedings of the Fifth International Conference on Database Systems for Advanced Applications (DASFAA)
Dynamic sample selection for approximate query processing

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A multi-dimensional histogram for selectivity estimation and fast approximate query answering

CASCON '03 Proceedings of the 2003 conference of the Centre for Advanced Studies on Collaborative research
Query Size Estimation for Joins Using Systematic Sampling

Distributed and Parallel Databases
Conditional selectivity for statistics on query expressions

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Histograms revisited: when are histograms the best approximation method for aggregates over joins?

Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Synopses for query optimization: A space-complexity perspective

ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2004
Techniques for Warehousing of Sample Data

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
New Sampling-Based Estimators for OLAP Queries

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
End-biased Samples for Join Cardinality Estimation

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
ISOMER: Consistent Histogram Construction Using Query Feedback

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
A dip in the reservoir: maintaining sample synopses of evolving datasets

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Consistent selectivity estimation via maximum entropy

The VLDB Journal — The International Journal on Very Large Data Bases
Optimized stratified sampling for approximate query processing

ACM Transactions on Database Systems (TODS)
Cardinality estimation using sample views with quality assurance

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Robust Stratified Sampling Plans for Low Selectivity Queries

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Understanding cardinality estimation using entropy maximization

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Relative (p,ε)-Approximations in Geometry

Discrete & Computational Geometry

Quantified Score

Hi-index	0.00

Visualization

Abstract

We develop a novel method, based on the statistical concept of VC-dimension, for evaluating the selectivity (output cardinality) of SQL queries - a crucial step in optimizing the execution of large scale database and data-mining operations. The major theoretical contribution of this work, which is of independent interest, is an explicit bound on the VC-dimension of a range space defined by all possible outcomes of a collection (class) of queries. We prove that the VC-dimension is a function of the maximum number of Boolean operations in the selection predicate, and of the maximum number of select and join operations in any individual query in the collection, but it is neither a function of the number of queries in the collection nor of the size of the database. We develop a method based on this result: given a class of queries, it constructs a concise random sample of a database, such that with high probability the execution of any query in the class on the sample provides an accurate estimate for the selectivity of the query on the original large database. The error probability holds simultaneously for the selectivity estimates of all queries in the collection, thus the same sample can be used to evaluate the selectivity of multiple queries, and the sample needs to be refreshed only following major changes in the database. The sample representation computed by our method is typically sufficiently small to be stored in main memory. We present extensive experimental results, validating our theoretical analysis and demonstrating the advantage of our technique when compared to complex selectivity estimation techniques used in PostgreSQL and the Microsoft SQL Server.