Generalized substring selectivity estimation

Authors:
Zhiyuan Chen;Flip Korn;Nick Koudas;S. Muthukrishnan
Affiliations:
Department of Computer Science, 4110 Upson Hall, Cornell University, Ithaca, NY;AT&T Labs-Research, 180 Park Ave., Florham Park, NJ;AT&T Labs-Research, 180 Park Ave., Florham Park, NJ;AT&T Labs-Research, 180 Park Ave., Florham Park, NJ
Venue:
Journal of Computer and System Sciences - Special issue on PODS 2000
Year:
2003

Citing 23
Cited 3

Equi-depth multidimensional histograms

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Optimal histograms for limiting worst-case error propagation in the size of join results

ACM Transactions on Database Systems (TODS)
Estimating alphanumeric selectivity in the presence of wildcards

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Improved histograms for selectivity estimation of range predicates

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
A guide to the SQL standard (4th ed.): a user's guide to the standard database language SQL

A guide to the SQL standard (4th ed.): a user's guide to the standard database language SQL
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Fast and effective query refinement

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Size-estimation framework with applications to transitive closure and reachability

Journal of Computer and System Sciences
Min-wise independent permutations (extended abstract)

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Substring selectivity estimation

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Selectively estimation for Boolean queries

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
On effective multi-dimensional indexing for strings

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Relevance ranking for one to three term queries

Information Processing and Management: an International Journal
Modern Information Retrieval

Modern Information Retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Selectivity Estimation in the Presence of Alphanumeric Correlations

ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
Multi-Dimensional Substring Selectivity Estimation

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Selectivity Estimation Without the Attribute Value Independence Assumption

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Finding Interesting Associations without Support Pruning

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Deflating the Dimensionality Curse Using Multiple Fractal Dimensions

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Design and Evaluation of Incremental Data Structures and Algorithms for Dynamic Query Interfaces

INFOVIS '97 Proceedings of the 1997 IEEE Symposium on Information Visualization (InfoVis '97)

Estimating the selectivity of approximate string queries

ACM Transactions on Database Systems (TODS)
Exploitation of semantic relationships and hierarchical data structures to support a user in his annotation and browsing activities in folksonomies

Information Systems
A query expansion and user profile enrichment approach to improve the performance of recommender systems operating on a folksonomy

User Modeling and User-Adapted Interaction

Quantified Score

Hi-index	0.00

Visualization

Abstract

In a variety of settings from relational databases to LDAP to Web applications, there is an increasing need to quickly and accurately estimate the count of tuples (LDAP entries, Web documents, etc.) matching Boolean substring queries. In providing such selectivity estimates, the correlation between different occurrences of substrings is crucial. Selectivity estimation for generalized Boolean queries has not been studied previously; our own prior work, which is discussed and extended herein, applies to the case of one-dimensional Boolean queries [CKKM00]. Existing methods for the case of multidimensional conjunctive queries approximate selectivities by explicitly storing cross-counts of frequently co-occurring combinations of substrings; estimates are obtained by parsing the query into multidimensional substrings corresponding to stored cross-counts and applying probabilistic formulae. The major problem with these methods is that the number of cross-counts stored by known methods increases exponentially with the number of dimensions (a "space dimensionality explosion") due to the need to capture the correlation amongst the dimensions. Hence, given a limited amount of space, none of the existing methods can reliably give accurate estimates. Moreover, these methods do not generalize to Boolean queries gracefully. We present a novel approach to selectivity estimation for generalized Boolean substring queries with a focus on the two cases of (1) conjunctive multidimensional and (2) Boolean queries. Our approach does not explicitly store cross-counts, but rather generates them on-the-fly. We employ a Monte Carlo technique called set hashing to succinctly represent the set of tuples containing a given substring as a signature vector of hash values; any combination of set hash signatures gives a cross-count when intersected. Thus, using only linear storage, a large number of cross-counts can be generated including those for complex co-occurrences of substrings. The cross-counts generated by our methods are not exact, but they are adequate for selectivity estimation. We present results from an extensive experimental evaluation of our approach on real data sets. For the case of multidimensional conjunctive queries, our approach achieves better accuracy by an order of magnitude, and scales much more gracefully to higher dimensions, than existing methods. Surprisingly, even though our approach involves generating cross-counts on-the-fly, estimation is very fast, taking 200 µs on a data set of size 6 MB. For the case of Boolean queries, our experiments also demonstrate the superiority of this approach over a straightforward independence-based approach wherein correlations are not captured.