Effective Indices for Efficient Approximate String Search and Similarity Join

Authors:
Xuhui Liu;Guoliang Li;Jianhua Feng;Lizhu Zhou
Affiliations:
-;-;-;-
Venue:
WAIM '08 Proceedings of the 2008 The Ninth International Conference on Web-Age Information Management
Year:
2008

Citing 0
Cited 2

Simple and efficient algorithm for approximate dictionary matching

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Efficient similarity search in very large string sets

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data collections often have inconsistencies that arise due to a variety of reasons, and it is desirable to be able to identify and resolve them efficiently. Similarity queries are commonly used in data cleaning for matching similar data. In this work we concentrate on the following problem of approximate string matching based on edit distance: from a collection of strings, how to find those strings similar to a given string, or the strings in another collection of strings with similarity greater than some threshold? We propose an NFA-based (Nondeterministic Finitestate Automation) method for effective approximate string search. We model strings as a trie and construct an NFA on top of the trie. We identify the similar strings by running the NFA based on the tree automata theory. Moreover, we propose grouped trie to further improve the performance of similarity search by incorporating some effective pruning techniques. We have implemented our method and the experimental results show that our approach achieves high performance and out performs the existing state-of-the-art methods by orders of magnitude.