One-dimensional and multi-dimensional substring selectivity estimation

Authors:
H. V. Jagadish;Olga Kapitskaia;Raymond T. Ng;Divesh Srivastava
Affiliations:
University of Michigan, Ann Arbor/ E-mail: jag@umich.edu;Pô/le Universitaire Lé/onard de Vinci/ E-mail: Olga.Kapitskaia@devinci.fr;University of British Columbia/ E-mail: rng@cs.ubc.ca;AT&T Labs &ndash/–/ Research, 180 Park Avenue, Bldg 103, Florham Park, NJ 07932, USA/ E-mail: divesh@research.att.com
Venue:
The VLDB Journal — The International Journal on Very Large Data Bases
Year:
2000

Citing 17
Cited 12

Theory of linear and integer programming

Theory of linear and integer programming
A Generalization of the Suffix Tree to Square Matrices, with Applications

SIAM Journal on Computing
The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Balancing histogram optimality and practicality for query result size estimation

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Estimating alphanumeric selectivity in the presence of wildcards

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Improved histograms for selectivity estimation of range predicates

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
On the construction of classes of suffix trees for square matrices: algorithms and applications

Information and Computation
New sampling-based summary statistics for improving approximate query answers

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Query size estimation by adaptive sampling (extended abstract)

PODS '90 Proceedings of the ninth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Substring selectivity estimation

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Access path selection in a relational database management system

SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Selectivity Estimation in the Presence of Alphanumeric Correlations

ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
Optimal Histograms with Quality Guarantees

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Multi-Dimensional Substring Selectivity Estimation

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Universality of Serial Histograms

VLDB '93 Proceedings of the 19th International Conference on Very Large Data Bases
Selectivity Estimation Without the Attribute Value Independence Assumption

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases

Using histograms to estimate answer sizes for XML queries

Information Systems - Special issue: Best papers from EDBT 2002
Estimating Answer Sizes for XML Queries

EDBT '02 Proceedings of the 8th International Conference on Extending Database Technology: Advances in Database Technology
Selectivity Estimation for String Predicates: Overcoming the Underestimation Problem

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
CXHist: an on-line classification-based histogram for XML string selectivity estimation

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Estimating the selectivity of approximate string queries

ACM Transactions on Database Systems (TODS)
OASIS: an online and accurate technique for local-alignment searches on biological sequences

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Scalable regular expression matching on data streams

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Improved count suffix trees for natural language data

IDEAS '08 Proceedings of the 2008 international symposium on Database engineering & applications
Hashed samples: selectivity estimators for set similarity selection queries

Proceedings of the VLDB Endowment
Efficient discovery of unusual patterns in time series

New Generation Computing
Result-size estimation for information-retrieval subqueries

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Selectivity estimation for hybrid queries over text-rich data graphs

Proceedings of the 16th International Conference on Extending Database Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the increasing importance of XML, LDAP directories, and text-based information sources on the Internet, there is an ever-greater need to evaluate queries involving (sub)string matching. In many cases, matches need to be on multiple attributes/dimensions, with correlations between the multiple dimensions. Effective query optimization in this context requires good selectivity estimates. In this paper, we use pruned count-suffix trees (PSTs) as the basic data structure for substring selectivity estimation. For the 1-D problem, we present a novel technique called MO (Maximal Overlap). We then develop and analyze two 1-D estimation algorithms, MOC and MOLC, based on MO and a constraint-based characterization of all possible completions of a given PST. For the k-D problem, we first generalize PSTs to multiple dimensions and develop a space- and time-efficient probabilistic algorithm to construct k-D PSTs directly. We then show how to extend MO to multiple dimensions. Finally, we demonstrate, both analytically and experimentally, that MO is both practical and substantially superior to competing algorithms.