Practical selectivity estimation through adaptive sampling

Authors:
Richard J. Lipton;Jeffrey F. Naughton;Donovan A. Schneider
Affiliations:
Department of Computer Science, Princeton University;Department of Computer Sciences, University of Wisconsin;Department of Computer Sciences, University of Wisconsin
Venue:
SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Year:
1990

Citing 16
Cited 104

Equi-depth multidimensional histograms

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Statistical profile estimation in database systems

ACM Computing Surveys (CSUR)
Processing aggregate relational queries with hard time constraints

SIGMOD '89 Proceedings of the 1989 ACM SIGMOD international conference on Management of data
Estimating the size of generalized transitive closures

VLDB '89 Proceedings of the 15th international conference on Very large data bases
Random sampling from B+ trees

VLDB '89 Proceedings of the 15th international conference on Very large data bases
Query size estimation by adaptive sampling (extended abstract)

PODS '90 Proceedings of the ninth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Statistical estimators for relational algebra expressions

Proceedings of the seventh ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
A model of data distribution based on texture analysis

SIGMOD '85 Proceedings of the 1985 ACM SIGMOD international conference on Management of data
Query optimization in star computer networks

ACM Transactions on Database Systems (TODS)
Access path selection in a relational database management system

SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Estimating block transfers and join sizes

SIGMOD '83 Proceedings of the 1983 ACM SIGMOD international conference on Management of data
Database evaluation using multiple regression techniques

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Accurate estimation of the number of tuples satisfying a condition

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Simple Random Sampling from Relational Databases

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Selectivity Estimation and Query Optimization in Large Databases with Highly Skewed Distribution of Column Values

VLDB '88 Proceedings of the 14th International Conference on Very Large Data Bases
Benchmarking Database Systems A Systematic Approach

VLDB '83 Proceedings of the 9th International Conference on Very Large Data Bases

Error-constrained COUNT query evaluation in relational databases

SIGMOD '91 Proceedings of the 1991 ACM SIGMOD international conference on Management of data
Join processing in relational databases

ACM Computing Surveys (CSUR)
Sequential sampling procedures for query size estimation

SIGMOD '92 Proceedings of the 1992 ACM SIGMOD international conference on Management of data
A supplement to sampling-based methods for query size estimation in a database system

ACM SIGMOD Record
Multiple join size estimation by virtual domains (extended abstract)

PODS '93 Proceedings of the twelfth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Fixed-precision estimation of join selectivity

PODS '93 Proceedings of the twelfth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
An instant and accurate size estimation method for joins and selections in a retrieval-intensive environment

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
On the development of a site selection optimizer for distributed and parallel database systems

CIKM '93 Proceedings of the second international conference on Information and knowledge management
Using statistical sampling for query optimization in heterogeneous library information systems

CSC '93 Proceedings of the 1993 ACM conference on Computer science
On the relative cost of sampling for join selectivity estimation

PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
The power of sampling in knowledge discovery

PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Combinatorial pattern discovery for scientific data: some preliminary results

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Adaptive selectivity estimation using query feedback

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Computation of partial query results with an adaptive stratified sampling technique

CIKM '95 Proceedings of the fourth international conference on Information and knowledge management
Balancing histogram optimality and practicality for query result size estimation

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Bifocal sampling for skew-resistant join size estimation

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Estimating alphanumeric selectivity in the presence of wildcards

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Improved histograms for selectivity estimation of range predicates

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Adaptive Algorithms for Join Processing in Distributed Database Systems

Distributed and Parallel Databases
Random sampling for histogram construction: how much is enough?

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Wavelet-based histograms for selectivity estimation

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Data cube approximation and histograms via wavelets

Proceedings of the seventh international conference on Information and knowledge management
Iterated DFT based techniques for join size estimation

Proceedings of the seventh international conference on Information and knowledge management
Solving Local Cost Estimation Problem for Global Query Optimization in Multidatabase Systems

Distributed and Parallel Databases
Tracking join and self-join sizes in limited storage

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Selectivity estimation in spatial databases

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Self-tuning histograms: building histograms without looking at data

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Join synopses for approximate query answering

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
On approximating rectangle tiling and packing

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Approximating multi-dimensional aggregate range queries over real attributes

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Modeling high-dimensional index structures using sampling

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Applying the golden rule of sampling for query estimation

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Selectivity estimation using probabilistic models

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Exploiting constraint-like data characterizations in query optimization

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Probabilistic query models for transaction data

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
SchemaSQL: An extension to SQL for multidatabase interoperability

ACM Transactions on Database Systems (TODS)
Fast incremental maintenance of approximate histograms

ACM Transactions on Database Systems (TODS)
Cost models for overlapping and multiversion structures

ACM Transactions on Database Systems (TODS)
Effective Query Size Estimation Using Neural Networks

Applied Intelligence
Approximate Query Answering Using Data Warehouse Striping

Journal of Intelligent Information Systems - Special issue on data warehousing and knowledge discovery
A Hybrid Estimator for Selectivity Estimation

IEEE Transactions on Knowledge and Data Engineering
Reducing the Braking Distance of an SQL Query Engine

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Optimal Histograms with Quality Guarantees

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Histogram-Based Approximation of Set-Valued Query-Answers

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Approximate Query Processing Using Wavelets

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Tradeoffs in Processing Complex Join Queries via Hashing in Multiprocessor Database Machines

VLDB '90 Proceedings of the 16th International Conference on Very Large Data Bases
Approximate Query Processing: Taming the TeraBytes

Proceedings of the 27th International Conference on Very Large Data Bases
Optimizing Boolean Expressions in Object-Bases

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Random Sampling from Pseudo-Ranked B+ Trees

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
SchemaSQL - A Language for Interoperability in Relational Multi-Database Systems

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Selectivity Estimation Without the Attribute Value Independence Assumption

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Approximate Query Answering In Numerical Databases

SSDBM '96 Proceedings of the Eighth International Conference on Scientific and Statistical Database Management
Performance Analysis of Database Systems

Performance Evaluation: Origins and Directions
Join algorithm costs revisited

The VLDB Journal — The International Journal on Very Large Data Bases
Query processing and optimization in Oracle Rdb

The VLDB Journal — The International Journal on Very Large Data Bases
Approximate query processing using wavelets

The VLDB Journal — The International Journal on Very Large Data Bases
Multiple-granularity interleaving for piggyback query processing

CASCON '99 Proceedings of the 1999 conference of the Centre for Advanced Studies on Collaborative research
A piggyback method to collect statistics for query optimization in database management systems

CASCON '98 Proceedings of the 1998 conference of the Centre for Advanced Studies on Collaborative research
Multi-resolution algorithms for building spatial histograms

ADC '03 Proceedings of the 14th Australasian database conference - Volume 17
A learning-based approach to estimate statistics of operators in continuous queries: a case study

DMKD '03 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets

IEEE Transactions on Knowledge and Data Engineering
Beyond Independence: Probabilistic Models for Query Approximation on Binary Transaction Data

IEEE Transactions on Knowledge and Data Engineering
Interchanging group-by and join in distributed query processing

CASCON '93 Proceedings of the 1993 conference of the Centre for Advanced Studies on Collaborative research: distributed computing - Volume 2
An integrated method for estimating selectivities in a multidatabase system

CASCON '93 Proceedings of the 1993 conference of the Centre for Advanced Studies on Collaborative research: distributed computing - Volume 2
Query Size Estimation for Joins Using Systematic Sampling

Distributed and Parallel Databases
A Selectivity Model for Fragmented Relations: Applied in Information Retrieval

IEEE Transactions on Knowledge and Data Engineering
Effective use of block-level sampling in statistics estimation

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Venn Sampling: A Novel Prediction Technique for Moving Objects

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Synopses for query optimization: a space-complexity perspective

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Selectivity estimators for multidimensional range queries over real attributes

The VLDB Journal — The International Journal on Very Large Data Bases
Towards a robust query optimizer: a principled and practical approach

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Relational confidence bounds are easy with the bootstrap

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Synopses for query optimization: A space-complexity perspective

ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2004
Summarizing level-two topological relations in large spatial datasets

ACM Transactions on Database Systems (TODS)
The Sort-Merge-Shrink join

ACM Transactions on Database Systems (TODS)
Resource control for java database extensions

COOTS'99 Proceedings of the 5th conference on USENIX Conference on Object-Oriented Technologies & Systems - Volume 5
Selectivity estimation by batch-query based histogram and parametric method

ADC '07 Proceedings of the eighteenth conference on Australasian database - Volume 63
Multiscale histograms: summarizing topological relations in large spatial datasets

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Probabilistic skylines on uncertain data

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Adaptive-sampling algorithms for answering aggregation queries on Web sites

Data & Knowledge Engineering
Analytic-based estimation of query result sizes

AIKED'05 Proceedings of the 4th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering Data Bases
Confidence bounds for sampling-based group by estimates

ACM Transactions on Database Systems (TODS)
Distinct value estimation on peer-to-peer networks

Proceedings of the 1st international conference on PErvasive Technologies Related to Assistive Environments
Query evaluation and optimization in the semantic web

Theory and Practice of Logic Programming
Distributed hash sketches: Scalable, efficient, and accurate cardinality estimation for distributed multisets

ACM Transactions on Computer Systems (TOCS)
A sampling approach for XML query selectivity estimation

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Depth estimation for ranking query optimization

The VLDB Journal — The International Journal on Very Large Data Bases
Sampling-based estimators for subset-based queries

The VLDB Journal — The International Journal on Very Large Data Bases
Progressive Evaluation of XML Queries for Online Aggregation and Progress Indicator

DEXA '09 Proceedings of the 20th International Conference on Database and Expert Systems Applications
Exact cardinality query optimization for optimizer testing

Proceedings of the VLDB Endowment
Adaptive dimensionality reduction for recent-biased time series analysis

Proceedings of the 1st Amrita ACM-W Celebration on Women in Computing in India
Result-size estimation for information-retrieval subqueries

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
The VC-dimension of SQL queries and selectivity estimation through sampling

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
Practical algorithms for tracking database join sizes

FSTTCS '05 Proceedings of the 25th international conference on Foundations of Software Technology and Theoretical Computer Science
Selectivity estimation of high dimensional window queries via clustering

SSTD'05 Proceedings of the 9th international conference on Advances in Spatial and Temporal Databases
Spatio-temporal histograms

SSTD'05 Proceedings of the 9th international conference on Advances in Spatial and Temporal Databases
Improving the accuracy of histograms for geographic data objects

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part I
A novel distributed framework for optimizing query routing trees in wireless sensor networks via optimal operator placement

Journal of Computer and System Sciences
STHist-C: a highly accurate cluster-based histogram for two and three dimensional geographic data points

Geoinformatica
CS2: a new database synopsis for query estimation

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Data Quality of Query Results with Generalized Selection Conditions

Operations Research
Entropy-based histograms for selectivity estimation

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Statistics collection in oracle spatial and graph: fast histogram construction for complex geometry objects

Proceedings of the VLDB Endowment
Bichromatic buckets: An effective technique to improve the accuracy of histograms for geographic data points

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recently we have proposed an adaptive, random sampling algorithm for general query size estimation. In earlier work we analyzed the asymptotic efficiency and accuracy of the algorithm, in this paper we investigate its practicality as applied to selects and joins. First, we extend our previous analysis to provide significantly improved bounds on the amount of sampling necessary for a given level of accuracy. Next, we provide “sanity bounds” to deal with queries for which the underlying data is extremely skewed or the query result is very small. Finally, we report on the performance of the estimation algorithm as implemented in a host language on a commercial relational system. The results are encouraging, even with this loose coupling between the estimation algorithm and the DBMS.