Efficient fuzzy search in large text collections

Authors:
Hannah Bast;Marjan Celikik
Affiliations:
Albert Ludwigs University, Freiburg, Germany;Albert Ludwigs University, Freiburg, Germany
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2013

Citing 51
Cited 0

Algorithms for approximate string matching

Information and Control
Fast text searching: allowing errors

Communications of the ACM
On the hardness of approximating minimization problems

STOC '93 Proceedings of the twenty-fifth annual ACM symposium on Theory of computing
Finding approximate matches in large lexicons

Software—Practice & Experience
One-time complete indexing of text: theory and practice

SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
A fast bit-vector algorithm for approximate string matching based on dynamic programming

Journal of the ACM (JACM)
Adaptive set intersections, unions, and differences

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
A hash code method for detecting and correcting spelling errors

Communications of the ACM
Adaptive correction of program statements

Communications of the ACM
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Searching in metric spaces

ACM Computing Surveys (CSUR)
Compression of inverted indexes For fast query evaluation

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
An Approach to Designing Very Fast Approximate String Matching Algorithms

IEEE Transactions on Knowledge and Data Engineering
Efficient Index Structures for String Databases

Proceedings of the 27th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Approximate String-Matching over Suffix Trees

CPM '93 Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching
Filtration with q-Samples in Approximate String Matching

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
Approximate Multiple Strings Search

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
On Using q-Gram Locations in Approximate String Matching

ESA '95 Proceedings of the Third Annual European Symposium on Algorithms
A Fast Algorithm on Average for All-Against-All Sequence Matching

SPIRE '99 Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware
Efficient single-pass index construction for text databases

Journal of the American Society for Information Science and Technology
Dictionary matching and indexing with errors and don't cares

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
An improved error model for noisy channel spelling correction

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
A New Indexing Method for Approximate Search in Text Databases

CIT '05 Proceedings of the The Fifth International Conference on Computer and Information Technology
Fast Approximate Search in Large Dictionaries

Computational Linguistics
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Exploring distributional similarity based models for query spelling correction

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Cross-lingual query suggestion using query logs of different languages

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Compressed permuterm index

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Query suggestion based on user landing pages

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Context-aware query suggestion by mining click-through and session data

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Proceedings of the VLDB Endowment
Query suggestion using hitting time

Proceedings of the 17th ACM conference on Information and knowledge management
Query suggestions using query-flow graphs

Proceedings of the 2009 workshop on Web Search Click Data
Efficient interactive fuzzy keyword search

Proceedings of the 18th international conference on World wide web
Fast error-tolerant search on very large texts

Proceedings of the 2009 ACM symposium on Applied Computing
Efficient Merging and Filtering Algorithms for Approximate String Searches

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Extending autocompletion to tolerate errors

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Faster and Space-Optimal Edit Distance "1" Dictionary

CPM '09 Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching
Indexing Variable Length Substrings for Exact and Approximate Matching

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Aging effects on query flow graphs for query suggestion

Proceedings of the 18th ACM conference on Information and knowledge management
Optimal rare query suggestion with implicit user feedback

Proceedings of the 19th international conference on World wide web
Indexing methods for approximate dictionary searching: Comparative analysis

Journal of Experimental Algorithmics (JEA)
Efficient similarity joins for near-duplicate detection

ACM Transactions on Database Systems (TODS)
Query suggestions in the absence of query logs

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Automatic boolean query suggestion for professional search

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
On the least cost for proximity searching in metric spaces

WEA'06 Proceedings of the 5th international conference on Experimental Algorithms
Query recommendation using query logs in search engines

EDBT'04 Proceedings of the 2004 international conference on Current Trends in Database Technology
Random texts exhibit Zipf's-law-like word frequency distribution

IEEE Transactions on Information Theory
Supporting efficient top-k queries in type-ahead search

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of fuzzy full-text search in large text collections, that is, full-text search which is robust against errors both on the side of the query as well as on the side of the documents. Standard inverted-index techniques work extremely well for ordinary full-text search but fail to achieve interactive query times (below 100 milliseconds) for fuzzy full-text search even on moderately-sized text collections (above 10 GBs of text). We present new preprocessing techniques that achieve interactive query times on large text collections (100 GB of text, served by a single machine). We consider two similarity measures, one where the query terms match similar terms in the collection (e.g., algorithm matches algoritm or vice versa) and one where the query terms match terms with a similar prefix in the collection (e.g., alori matches algorithm). The latter is important when we want to display results instantly after each keystroke (search as you type). All algorithms have been fully integrated into the CompleteSearch engine.