On-line Approximate String Matching in Natural Language

Authors:
Kimmo Fredriksson
Affiliations:
Department of Computer Science, University of Joensuu, PO Box 111, 80101 Joensuu, Finland. E-mail: kfredrik@cs.joensuu.fi
Venue:
Fundamenta Informaticae
Year:
2006

Citing 11
Cited 0

Algorithms for approximate string matching

Information and Control
A fast bit-vector algorithm for approximate string matching based on dynamic programming

Journal of the ACM (JACM)
Fast and flexible word searching on compressed text

ACM Transactions on Information Systems (TOIS)
Information Retrieval: Computational and Theoretical Aspects

Information Retrieval: Computational and Theoretical Aspects
Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences

Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences
New and faster filters for multiple approximate string matching

Random Structures & Algorithms
Tries for Approximate String Matching

IEEE Transactions on Knowledge and Data Engineering
Approximate String Matching and Local Similarity

CPM '94 Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching
A Bit-Parallel Approach to Suffix Automata: Fast Extended String Matching

CPM '98 Proceedings of the 9th Annual Symposium on Combinatorial Pattern Matching
Bit-Parallel Witnesses and Their Applications to Approximate String Matching

Algorithmica
Average-optimal single and multiple approximate string matching

Journal of Experimental Algorithmics (JEA)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider approximate pattern matching in natural language text. We use the words of the text as the alphabet, instead of the characters as in traditional string matching approaches. Hence our pattern consists of a sequence of words. From the algorithmic point of view this has several advantages: (i) the number of words is much less than the number of characters, which in effect means shorter text (less possible matching positions); (ii) the pattern effectively becomes shorter, so bit-parallel techniques become more applicable; (iii) the alphabet size becomes much larger, so the probability that two symbols (in this case, words) match is reduced. We extend several known approximate string matching algorithms for this scenario, allowing k insertions, deletions or substitutions of symbols (natural language words). We further extend the algorithms to allow k' errors inside the pattern symbols (words) as well. The two error thresholds k and k' can be applied simultaneously and independently. Hence we have in effect two alphabets, and perform approximate matching in both levels. From the application point of view the advantage is that the method is flexible, allowing simple solutions to problems that are hard to solve with traditional approaches. Finally, we extend the algorithms to handle multiple patterns at the same time. Depending on the search parameters, we obtain algorithms that run in linear or sublinear time and that perform the optimal number of word comparisons on average, We conclude with experimental results showing that the methods work well in practice.