Estimating alphanumeric selectivity in the presence of wildcards
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
q-gram based database searching using a suffix array (QUASAR)
RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
Substring selectivity estimation
PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A guided tour to approximate string matching
ACM Computing Surveys (CSUR)
Fast and flexible string matching by combining bit-parallelism and suffix automata
Journal of Experimental Algorithmics (JEA)
Modern Information Retrieval
Automatic Speech and Speaker Recognition
Automatic Speech and Speaker Recognition
Estimating the Selectivity of XML Path Expressions for Internet Scale Applications
Proceedings of the 27th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
One-Gapped q-Gram Filtersfor Levenshtein Distance
CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Robust and efficient fuzzy match for online data cleaning
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Selectivity Estimation for String Predicates: Overcoming the Underestimation Problem
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Indexing text data under space constraints
Proceedings of the thirteenth ACM international conference on Information and knowledge management
Selectivity estimation for fuzzy string predicates in large data sets
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Indexing mixed types for approximate retrieval
VLDB '05 Proceedings of the 31st international conference on Very large data bases
XPathLearner: an on-line self-tuning Markov histogram for XML path selectivity estimation
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Hashed samples: selectivity estimators for set similarity selection queries
Proceedings of the VLDB Endowment
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints
Proceedings of the VLDB Endowment
Approximate substring selectivity estimation
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Efficient approximate entity extraction with edit distance constraints
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient approximate search on string collections
Proceedings of the VLDB Endowment
Power-law based estimation of set similarity join size
Proceedings of the VLDB Endowment
Probabilistic string similarity joins
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Simple and efficient algorithm for approximate dictionary matching
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient fuzzy full-text type-ahead search
The VLDB Journal — The International Journal on Very Large Data Bases
Pass-join: a partition-based method for similarity joins
Proceedings of the VLDB Endowment
Supporting efficient top-k queries in type-ahead search
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Selectivity estimation for hybrid queries over text-rich data graphs
Proceedings of the 16th International Conference on Extending Database Technology
PartSS: an efficient partition-based filtering for edit distance constraints
ADC '11 Proceedings of the Twenty-Second Australasian Database Conference - Volume 115
Efficient top-k algorithms for approximate substring matching
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
A partition-based method for string similarity joins with edit-distance constraints
ACM Transactions on Database Systems (TODS)
Extending string similarity join to tolerant fuzzy token matching
ACM Transactions on Database Systems (TODS)
Hi-index | 0.00 |
There are many emerging database applications that require accurate selectivity estimation of approximate string matching queries. Edit distance is one of the most commonly used string similarity measures. In this paper, we study the problem of estimating selectivity of string matching with low edit distance. Our framework is based on extending q-grams with wildcards. Based on the concepts of replacement semi-lattice, string hierarchy and a combinatorial analysis, we develop the formulas for selectivity estimation and provide the algorithm BasicEQ. We next develop the algorithm Opt EQ by enhancing BasicEQ with two novel improvements. Finally we show a comprehensive set of experiments using three benchmarks comparing Opt EQ with the state-of-the-art method SEPIA. Our experimental results show that Opt EQ delivers more accurate selectivity estimations.