Faster pattern matching with character classes using prime number encoding

  • Authors:
  • Chaim Linhart;Ron Shamir

  • Affiliations:
  • School of Computer Science, Tel Aviv University, Tel-Aviv 69978, Israel;School of Computer Science, Tel Aviv University, Tel-Aviv 69978, Israel

  • Venue:
  • Journal of Computer and System Sciences
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

In pattern matching with character classes the goal is to find all occurrences of a pattern of length m in a text of length n, where each pattern position consists of an allowed set of characters from a finite alphabet @S. We present an FFT-based algorithm that uses a novel prime-numbers encoding scheme, which is logn/logm times faster than the fastest extant approaches, which are based on boolean convolutions. In particular, if m^|^@S^|=n^O^(^1^), our algorithm runs in time O(nlogm), matching the complexity of the fastest techniques for wildcard matching, a special case of our problem. A major advantage of our algorithm is that it allows a tradeoff between the running time and the RAM word size. Our algorithm also speeds up solutions to approximate matching with character classes problems-namely, matching with k mismatches and Hamming distance, as well as to the subset matching problem.