Selectively estimation for Boolean queries

Authors:
Zhiyuan Chen;Nick Koudas;Flip Korn;S. Muthukrishnan
Affiliations:
AT&T Labs, Cornell University;AT&T Labs-Research;AT&T Labs-Research;AT&T Labs-Research
Venue:
PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Year:
2000

Citing 15
Cited 25

Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
Estimating alphanumeric selectivity in the presence of wildcards

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Fast and effective query refinement

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Size-estimation framework with applications to transitive closure and reachability

Journal of Computer and System Sciences
Min-wise independent permutations (extended abstract)

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Substring selectivity estimation

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Relevance ranking for one to three term queries

Information Processing and Management: an International Journal
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Selectivity Estimation in the Presence of Alphanumeric Correlations

ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
Optimal Histograms with Quality Guarantees

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Multi-Dimensional Substring Selectivity Estimation

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Finding Interesting Associations without Support Pruning

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Design and Evaluation of Incremental Data Structures and Algorithms for Dynamic Query Interfaces

INFOVIS '97 Proceedings of the 1997 IEEE Symposium on Information Visualization (InfoVis '97)

Efficient and tumble similar set retrieval

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Algorithmics and applications of tree and graph searching

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Mining database structure; or, how to build a data quality browser

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Using histograms to estimate answer sizes for XML queries

Information Systems - Special issue: Best papers from EDBT 2002
Estimating Answer Sizes for XML Queries

EDBT '02 Proceedings of the 8th International Conference on Extending Database Technology: Advances in Database Technology
Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Proceedings of the 27th International Conference on Very Large Data Bases
Generalized substring selectivity estimation

Journal of Computer and System Sciences - Special issue on PODS 2000
Processing set expressions over continuous update streams

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Selectivity Estimation for String Predicates: Overcoming the Underestimation Problem

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Fast computation of database operations using graphics processors

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Tracking set-expression cardinalities over continuous update streams

The VLDB Journal — The International Journal on Very Large Data Bases
Selectivity estimation for fuzzy string predicates in large data sets

VLDB '05 Proceedings of the 31st international conference on Very large data bases
CXHist: an on-line classification-based histogram for XML string selectivity estimation

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Pruning subscriptions in distributed publish/subscribe systems

ACSC '06 Proceedings of the 29th Australasian Computer Science Conference - Volume 48
Fast computation of database operations using graphics processors

SIGGRAPH '05 ACM SIGGRAPH 2005 Courses
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
SEPIA: estimating selectivities of approximate string predicates in large Databases

The VLDB Journal — The International Journal on Very Large Data Bases
Optimized union of non-disjoint distributed data sets

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Approximate substring selectivity estimation

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Power-law based estimation of set similarity join size

Proceedings of the VLDB Endowment
Result-size estimation for information-retrieval subqueries

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A decomposition-based probabilistic framework for estimating the selectivity of XML twig queries

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Efficient top-k algorithms for approximate substring matching

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

In a variety of applications ranging from optimizing queries on alphanumeric attributes to providing approximate counts of documents containing several query terms, there is an increasing need to quickly and reliably estimate the number of strings (tuples, documents, etc.) matching a Boolean query. Boolean queries in this context consist of substring predicates composed using Boolean operators. While there has been some work in estimating the selectivity of substring queries, the more general problem of estimating the selectivity of Boolean queries over substring predicates has not been studied.Our approach is to extract selectivity estimates from relationships between the substring predicates of the Boolean query. However, storing the correlation between all possible predicates in order to provide an exact answer to such predicates is clearly infeasible, as there is a super-exponential number of possible combinations of these predicates. Instead, our novel idea is to capture correlations in a space-efficient but approximate manner. We employ a Monte Carlo technique called set hashing to succinctly represent the set of strings containing a given substring as a signature vector of hash values. Correlations among substring predicates can then be generated on-the-fly by operating on these signatures.We formalize our approach and propose an algorithm for estimating the selectivity of any Boolean query using the signatures of its substring predicates. We then experimentally demonstrate the superiority of our approach over a straight-forward approach based on the independence assumption wherein correlations are not explicitly captured.