Fast error-tolerant search on very large texts

Authors:
Marjan Celikik;Holger Bast
Affiliations:
Max Planck Institute for Computer Science, Saarbrücken, Germany;Max Planck Institute for Computer Science, Saarbrücken, Germany
Venue:
Proceedings of the 2009 ACM symposium on Applied Computing
Year:
2009

Citing 19
Cited 7

Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
Results of applying probabilistic IR to OCR text

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Finding approximate matches in large lexicons

Software—Practice & Experience
Effects of OCR errors on ranking and feedback using the vector space model

Information Processing and Management: an International Journal
Learning String-Edit Distance

IEEE Transactions on Pattern Analysis and Machine Intelligence
Automatic spelling correction in scientific and scholarly text

Communications of the ACM
Searching in metric spaces

ACM Computing Surveys (CSUR)
Finding Interesting Associations without Support Pruning

IEEE Transactions on Knowledge and Data Engineering
Evaluating a Spelling Support in a Search Engine

NLDB '02 Proceedings of the 6th International Conference on Applications of Natural Language to Information Systems-Revised Papers
Filtration with q-Samples in Approximate String Matching

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
Approximate Multiple Strings Search

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
Non-word identification or spell checking without a dictionary

Journal of the American Society for Information Science and Technology
An improved error model for noisy channel spelling correction

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Fast Approximate Search in Large Dictionaries

Computational Linguistics
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Type less, find more: fast autocompletion search with a succinct index

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
On the least cost for proximity searching in metric spaces

WEA'06 Proceedings of the 5th international conference on Experimental Algorithms

Efficient two-sided error-tolerant search

Proceedings of the 2nd International Workshop on Keyword Search on Structured Data
Managing misspelled queries in IR applications

Information Processing and Management: an International Journal
Fast construction of the HYB index

ACM Transactions on Information Systems (TOIS)
Efficient fuzzy full-text type-ahead search

The VLDB Journal — The International Journal on Very Large Data Bases
Efficient similarity search in very large string sets

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
Efficient fuzzy search in large text collections

ACM Transactions on Information Systems (TOIS)
Spaces, Trees, and Colors: The algorithmic landscape of document retrieval on sequences

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the following spelling variants clustering problem: Given a list of distinct words, called lexicon, compute (possibly overlapping) clusters of words which are spelling variants of each other. This problem naturally arises in the context of error-tolerant full-text search of the following kind: For a given query, return not only documents matching the query words exactly but also those matching their spelling variants. This is the inverse of the well-known "Did you mean: … ?" web search engine feature, where the error tolerance is on the side of the query, and not on the side of the documents. We combine various ideas from the large body of literature on approximate string searching and spelling correction techniques to a new algorithm for the spelling variants clustering problem that is both accurate and very efficient in time and space. Our largest lexicon, containing roughly 10 million words, can be processed in about 16 minutes on a standard PC using 10 MB of additional space. This beats the previously best scheme by a factor of two in running time and by a factor of more than ten in space usage. We have integrated our algorithms into the CompleteSearch engine in a way that achieves error-tolerant search without significant blowup in neither index size nor query processing time.