Efficient top-k algorithms for fuzzy search in string collections

  • Authors:
  • Rares Vernica;Chen Li

  • Affiliations:
  • University of California, Irvine, CA;University of California, Irvine, CA

  • Venue:
  • Proceedings of the First International Workshop on Keyword Search on Structured Data
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

An approximate search query on a collection of strings finds those strings in the collection that are similar to a given query string, where similarity is defined using a given similarity function such as Jaccard, cosine, and edit distance. Answering approximate queries efficiently is important in many applications such as search engines, data cleaning, query relaxation, and spell checking, where inconsistencies and errors exist in user queries as well as data. In this paper, we study the problem of efficiently computing the best answers to an approximate string query, where the quality of a string is based on both its importance and its similarity to the query string. We first develop a progressive algorithm that answers a ranking query by using the results of several approximate range queries, leveraging existing search techniques. We then develop efficient algorithms for answering ranking queries using indexing structures of gram-based inverted lists. We answer a ranking query by traversing the inverted lists, pruning and skipping irrelevant string ids, iteratively increasing the pruning and skipping power, and doing early termination. We have conducted extensive experiments on real datasets to evaluate the proposed algorithms and report our findings.