Fast string matching by using probabilities: on an optimal mismatch variant of Horspool's algorithm

Authors:
Markus E. Nebel
Affiliations:
Fachbereich Informatik, Technische Universität Kaiserslautern, Kaiserslautern, Germany
Venue:
Theoretical Computer Science
Year:
2006

Citing 6
Cited 3

Algorithms

Algorithms
A very fast substring search algorithm

Communications of the ACM
Introduction to algorithms

Introduction to algorithms
Analysis of Boyer-Moore-Horspool string-matching heuristic

Random Structures & Algorithms - Special issue: average-case analysis of algorithms
A fast string searching algorithm

Communications of the ACM
Handbook of Exact String Matching Algorithms

Handbook of Exact String Matching Algorithms

Real-Time String Filtering of Large Databases Implemented Via a Combination of Artificial Neural Networks

ICANNGA '07 Proceedings of the 8th international conference on Adaptive and Natural Computing Algorithms, Part II
Improving Boyer-Moore-Horspool using machine-words for comparison

Proceedings of the 48th Annual Southeast Regional Conference
The exact online string matching problem: A review of the most recent results

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	5.23

Visualization

Abstract

The string matching problem, i.e. the task of finding all occurrences of one string as a substring of another one, is a fundamental problem in computer science. Recently, this problem received a great deal of attention due to numerous applications in computational biology. In this paper we address a modified version of Horspool's string matching algorithm using the probabilities of the different symbols to speed up the search. We show that the modified algorithm has a linear average running time; a precise asymptotical representation of the running time will be proven. A comparison of the average running time of the modified algorithm with well-known results for the original method shows that a substantial speed up for most of the symbol distributions has been achieved. Finally, we show that the distribution of the symbols can be approximated to a high precision using a random sample of sublinear size.