Simple and efficient algorithm for approximate dictionary matching
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Efficient similarity search in very large string sets
SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
Hi-index | 0.00 |
Data collections often have inconsistencies that arise due to a variety of reasons, and it is desirable to be able to identify and resolve them efficiently. Similarity queries are commonly used in data cleaning for matching similar data. In this work we concentrate on the following problem of approximate string matching based on edit distance: from a collection of strings, how to find those strings similar to a given string, or the strings in another collection of strings with similarity greater than some threshold? We propose an NFA-based (Nondeterministic Finitestate Automation) method for effective approximate string search. We model strings as a trie and construct an NFA on top of the trie. We identify the similar strings by running the NFA based on the tree automata theory. Moreover, we propose grouped trie to further improve the performance of similarity search by incorporating some effective pruning techniques. We have implemented our method and the experimental results show that our approach achieves high performance and out performs the existing state-of-the-art methods by orders of magnitude.