Techniques for automatically correcting words in text
ACM Computing Surveys (CSUR)
Results of applying probabilistic IR to OCR text
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Finding approximate matches in large lexicons
Software—Practice & Experience
Effects of OCR errors on ranking and feedback using the vector space model
Information Processing and Management: an International Journal
IEEE Transactions on Pattern Analysis and Machine Intelligence
Automatic spelling correction in scientific and scholarly text
Communications of the ACM
ACM Computing Surveys (CSUR)
Finding Interesting Associations without Support Pruning
IEEE Transactions on Knowledge and Data Engineering
Evaluating a Spelling Support in a Search Engine
NLDB '02 Proceedings of the 6th International Conference on Applications of Natural Language to Information Systems-Revised Papers
Filtration with q-Samples in Approximate String Matching
CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
Approximate Multiple Strings Search
CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
Non-word identification or spell checking without a dictionary
Journal of the American Society for Information Science and Technology
An improved error model for noisy channel spelling correction
ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Fast Approximate Search in Large Dictionaries
Computational Linguistics
A Primitive Operator for Similarity Joins in Data Cleaning
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Type less, find more: fast autocompletion search with a succinct index
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Scaling up all pairs similarity search
Proceedings of the 16th international conference on World Wide Web
Efficient similarity joins for near duplicate detection
Proceedings of the 17th international conference on World Wide Web
On the least cost for proximity searching in metric spaces
WEA'06 Proceedings of the 5th international conference on Experimental Algorithms
Efficient two-sided error-tolerant search
Proceedings of the 2nd International Workshop on Keyword Search on Structured Data
Managing misspelled queries in IR applications
Information Processing and Management: an International Journal
Fast construction of the HYB index
ACM Transactions on Information Systems (TOIS)
Efficient fuzzy full-text type-ahead search
The VLDB Journal — The International Journal on Very Large Data Bases
Efficient similarity search in very large string sets
SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
Efficient fuzzy search in large text collections
ACM Transactions on Information Systems (TOIS)
Spaces, Trees, and Colors: The algorithmic landscape of document retrieval on sequences
ACM Computing Surveys (CSUR)
Hi-index | 0.00 |
We consider the following spelling variants clustering problem: Given a list of distinct words, called lexicon, compute (possibly overlapping) clusters of words which are spelling variants of each other. This problem naturally arises in the context of error-tolerant full-text search of the following kind: For a given query, return not only documents matching the query words exactly but also those matching their spelling variants. This is the inverse of the well-known "Did you mean: … ?" web search engine feature, where the error tolerance is on the side of the query, and not on the side of the documents. We combine various ideas from the large body of literature on approximate string searching and spelling correction techniques to a new algorithm for the spelling variants clustering problem that is both accurate and very efficient in time and space. Our largest lexicon, containing roughly 10 million words, can be processed in about 16 minutes on a standard PC using 10 MB of additional space. This beats the previously best scheme by a factor of two in running time and by a factor of more than ten in space usage. We have integrated our algorithms into the CompleteSearch engine in a way that achieves error-tolerant search without significant blowup in neither index size nor query processing time.