Fast Approximate Search in Large Dictionaries

Authors:
Stoyan Mihov;Klaus U. Schulz
Affiliations:
-;-
Venue:
Computational Linguistics
Year:
2004

Citing 25
Cited 18

Algorithms for approximate string matching

Information and Control
Fast approximate string matching

Software—Practice & Experience
A spelling correction method and its application to an OCR system

Pattern Recognition
On partitioning a dictionary for visual text recognition

Pattern Recognition
Fast dictionary look-up for contextual word recognition

Pattern Recognition
Fast text searching: allowing errors

Communications of the ACM
An approximate string-matching algorithm

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Approximate string-matching with q-grams and maximal matches

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
Fast string matching using an n-gram algorithm

Software—Practice & Experience
String searching algorithms

String searching algorithms
Finding approximate matches in large lexicons

Software—Practice & Experience
Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction

Computational Linguistics
The String-to-String Correction Problem

Journal of the ACM (JACM)
Very fast and simple approximate string matching

Information Processing Letters
Integrating diverse knowledge sources in text recognition

ACM Transactions on Information Systems (TOIS)
Retrieval of misspelled names in an airlines passenger record system

Communications of the ACM
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Computer Text Recognition and Error Correction

Computer Text Recognition and Error Correction
Automata and Computability

Automata and Computability
Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences

Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences
Introduction To Automata Theory, Languages, And Computation

Introduction To Automata Theory, Languages, And Computation
Approximate Multiple Strings Search

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
Lexical postprocessing by heuristic search and automatic determination of the edit costs

ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 2) - Volume 2
Incremental construction of minimal acyclic finite-state automata

Computational Linguistics - Special issue on finite-state methods in NLP

Adaptive text correction with Web-crawled domain-dependent dictionaries

ACM Transactions on Speech and Language Processing (TSLP)
EXTRA: a system for example-based translation assistance

Machine Translation
Application of q-Gram Distance in Digital Forensic Search

IWCF '08 Proceedings of the 2nd international workshop on Computational Forensics
Ordering the suggestions of a spellchecker without using context*

Natural Language Engineering
Fast error-tolerant search on very large texts

Proceedings of the 2009 ACM symposium on Applied Computing
Using automated error profiling of texts for improved selection of correction candidates for garbled tokens

AI'07 Proceedings of the 20th Australian joint conference on Advances in artificial intelligence
Non-interactive OCR post-correction for giga-scale digitization projects

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
Improved fast similarity search in dictionaries

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Managing misspelled queries in IR applications

Information Processing and Management: an International Journal
Indexing methods for approximate dictionary searching: Comparative analysis

Journal of Experimental Algorithmics (JEA)
Deciding word neighborhood with universal neighborhood automata

Theoretical Computer Science
Efficiently generating correction suggestions for garbled tokens of historical language

Natural Language Engineering
A fast and accurate method for approximate string search

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Computation of similarity: similarity search as computation

CiE'11 Proceedings of the 7th conference on Models of computation in context: computability in Europe
A dictionary-based approach to fast and accurate name matching in large law enforcement databases

ISI'06 Proceedings of the 4th IEEE international conference on Intelligence and Security Informatics
Super-Linear indices for approximate dictionary searching

SISAP'12 Proceedings of the 5th international conference on Similarity Search and Applications
WallBreaker: overcoming the wall effect in similarity search

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Efficient fuzzy search in large text collections

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The need to correct garbled strings arises in many areas of natural language processing. If a dictionary is available that covers all possible input tokens, a natural set of candidates for correcting an erroneous input P is the set of all words in the dictionary for which the Levenshtein distance to P does not exceed a given (small) bound k. In this article we describe methods for efficiently selecting such candidate sets. After introducing as a starting point a basic correction method based on the concept of a "universal Levenshtein automaton," we show how two filtering methods known from the field of approximate text search can be used to improve the basic procedure in a significant way. The first method, which uses standard dictionaries plus dictionaries with reversed words, leads to very short correction times for most classes of input strings. Our evaluation results demonstrate that correction times for fixed-distance bounds depend on the expected number of correction candidates, which decreases for longer input words. Similarly the choice of an optimal filtering method depends on the length of the input words.