Effective Indices for Efficient Approximate String Search and Similarity Join

  • Authors:
  • Xuhui Liu;Guoliang Li;Jianhua Feng;Lizhu Zhou

  • Affiliations:
  • -;-;-;-

  • Venue:
  • WAIM '08 Proceedings of the 2008 The Ninth International Conference on Web-Age Information Management
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Data collections often have inconsistencies that arise due to a variety of reasons, and it is desirable to be able to identify and resolve them efficiently. Similarity queries are commonly used in data cleaning for matching similar data. In this work we concentrate on the following problem of approximate string matching based on edit distance: from a collection of strings, how to find those strings similar to a given string, or the strings in another collection of strings with similarity greater than some threshold? We propose an NFA-based (Nondeterministic Finitestate Automation) method for effective approximate string search. We model strings as a trie and construct an NFA on top of the trie. We identify the similar strings by running the NFA based on the tree automata theory. Moreover, we propose grouped trie to further improve the performance of similarity search by incorporating some effective pruning techniques. We have implemented our method and the experimental results show that our approach achieves high performance and out performs the existing state-of-the-art methods by orders of magnitude.