Fast string matching by using probabilities: on an optimal mismatch variant of Horspool's algorithm

  • Authors:
  • Markus E. Nebel

  • Affiliations:
  • Fachbereich Informatik, Technische Universität Kaiserslautern, Kaiserslautern, Germany

  • Venue:
  • Theoretical Computer Science
  • Year:
  • 2006

Quantified Score

Hi-index 5.23

Visualization

Abstract

The string matching problem, i.e. the task of finding all occurrences of one string as a substring of another one, is a fundamental problem in computer science. Recently, this problem received a great deal of attention due to numerous applications in computational biology. In this paper we address a modified version of Horspool's string matching algorithm using the probabilities of the different symbols to speed up the search. We show that the modified algorithm has a linear average running time; a precise asymptotical representation of the running time will be proven. A comparison of the average running time of the modified algorithm with well-known results for the original method shows that a substantial speed up for most of the symbol distributions has been achieved. Finally, we show that the distribution of the symbols can be approximated to a high precision using a random sample of sublinear size.