Efficient fuzzy search in large text collections

  • Authors:
  • Hannah Bast;Marjan Celikik

  • Affiliations:
  • Albert Ludwigs University, Freiburg, Germany;Albert Ludwigs University, Freiburg, Germany

  • Venue:
  • ACM Transactions on Information Systems (TOIS)
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

We consider the problem of fuzzy full-text search in large text collections, that is, full-text search which is robust against errors both on the side of the query as well as on the side of the documents. Standard inverted-index techniques work extremely well for ordinary full-text search but fail to achieve interactive query times (below 100 milliseconds) for fuzzy full-text search even on moderately-sized text collections (above 10 GBs of text). We present new preprocessing techniques that achieve interactive query times on large text collections (100 GB of text, served by a single machine). We consider two similarity measures, one where the query terms match similar terms in the collection (e.g., algorithm matches algoritm or vice versa) and one where the query terms match terms with a similar prefix in the collection (e.g., alori matches algorithm). The latter is important when we want to display results instantly after each keystroke (search as you type). All algorithms have been fully integrated into the CompleteSearch engine.