Extending q-grams to estimate selectivity of string matching with low edit distance

Authors:
Hongrae Lee;Raymond T. Ng;Kyuseok Shim
Affiliations:
Univ. of British Columbia;Univ. of British Columbia;Seoul National Univ.
Venue:
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Year:
2007

Citing 19
Cited 18

Estimating alphanumeric selectivity in the presence of wildcards

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
q-gram based database searching using a suffix array (QUASAR)

RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
Substring selectivity estimation

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Fast and flexible string matching by combining bit-parallelism and suffix automata

Journal of Experimental Algorithmics (JEA)
Modern Information Retrieval

Modern Information Retrieval
Automatic Speech and Speaker Recognition

Automatic Speech and Speaker Recognition
Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Proceedings of the 27th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
One-Gapped q-Gram Filtersfor Levenshtein Distance

CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Selectivity Estimation for String Predicates: Overcoming the Underestimation Problem

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Indexing text data under space constraints

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Selectivity estimation for fuzzy string predicates in large data sets

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Indexing mixed types for approximate retrieval

VLDB '05 Proceedings of the 31st international conference on Very large data bases
XPathLearner: an on-line self-tuning Markov histogram for XML path selectivity estimation

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

Cost-based variable-length-gram selection for string collections to support approximate queries efficiently

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Hashed samples: selectivity estimators for set similarity selection queries

Proceedings of the VLDB Endowment
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Proceedings of the VLDB Endowment
Approximate substring selectivity estimation

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Efficient approximate entity extraction with edit distance constraints

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient approximate search on string collections

Proceedings of the VLDB Endowment
Power-law based estimation of set similarity join size

Proceedings of the VLDB Endowment
Probabilistic string similarity joins

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Simple and efficient algorithm for approximate dictionary matching

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient fuzzy full-text type-ahead search

The VLDB Journal — The International Journal on Very Large Data Bases
Pass-join: a partition-based method for similarity joins

Proceedings of the VLDB Endowment
Supporting efficient top-k queries in type-ahead search

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Selectivity estimation for hybrid queries over text-rich data graphs

Proceedings of the 16th International Conference on Extending Database Technology
PartSS: an efficient partition-based filtering for edit distance constraints

ADC '11 Proceedings of the Twenty-Second Australasian Database Conference - Volume 115
Efficient top-k algorithms for approximate substring matching

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
A partition-based method for string similarity joins with edit-distance constraints

ACM Transactions on Database Systems (TODS)
Extending string similarity join to tolerant fuzzy token matching

ACM Transactions on Database Systems (TODS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

There are many emerging database applications that require accurate selectivity estimation of approximate string matching queries. Edit distance is one of the most commonly used string similarity measures. In this paper, we study the problem of estimating selectivity of string matching with low edit distance. Our framework is based on extending q-grams with wildcards. Based on the concepts of replacement semi-lattice, string hierarchy and a combinatorial analysis, we develop the formulas for selectivity estimation and provide the algorithm BasicEQ. We next develop the algorithm Opt EQ by enhancing BasicEQ with two novel improvements. Finally we show a comprehensive set of experiments using three benchmarks comparing Opt EQ with the state-of-the-art method SEPIA. Our experimental results show that Opt EQ delivers more accurate selectivity estimations.