Selectively estimation for Boolean queries
PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A guided tour to approximate string matching
ACM Computing Surveys (CSUR)
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Dictionary matching and indexing with errors and don't cares
STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Selectivity estimation for fuzzy string predicates in large data sets
VLDB '05 Proceedings of the 31st international conference on Very large data bases
A web-based kernel function for measuring the similarity of short text snippets
Proceedings of the 15th international conference on World Wide Web
Estimating the selectivity of approximate string queries
ACM Transactions on Database Systems (TODS)
Low distortion embeddings for edit distance
Journal of the ACM (JACM)
Extending q-grams to estimate selectivity of string matching with low edit distance
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
An Approximate String Matching Algorithm for Content-Based Music Data Retrieval
ICMCS '99 Proceedings of the IEEE International Conference on Multimedia Computing and Systems - Volume 2
Approximate substring selectivity estimation
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Approximating edit distance in near-linear time
Proceedings of the forty-first annual ACM symposium on Theory of computing
Efficient Merging and Filtering Algorithms for Approximate String Searches
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Efficient top-k algorithms for fuzzy search in string collections
Proceedings of the First International Workshop on Keyword Search on Structured Data
Extending autocompletion to tolerate errors
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient approximate entity extraction with edit distance constraints
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Modern Information Retrieval
Bed-tree: an all-purpose index structure for string similarity search based on edit distance
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Hi-index | 0.00 |
There is a wide range of applications that require to query a large database of texts to search for similar strings or substrings. Traditional approximate substring matching requests a user to specify a similarity threshold. Without top-k approximate substring matching, users have to try repeatedly different maximum distance threshold values when the proper threshold is unknown in advance. In our paper, we first propose the efficient algorithms for finding the top-k approximate substring matches with a given query string in a set of data strings. To reduce the number of expensive distance computations, the proposed algorithms utilize our novel filtering techniques which take advantages of q-grams and inverted q-gram indexes available. We conduct extensive experiments with real-life data sets. Our experimental results confirm the effectiveness and scalability of our proposed algorithms.