Efficient implementation of Aho–Corasick pattern matching automata using Unicode

  • Authors:
  • Janne Nieminen;Pekka Kilpeläinen

  • Affiliations:
  • Department of Computer Science, University of Kuopio, P.O. Box 1627, FI-70211 Kuopio, Finland;Department of Computer Science, University of Kuopio, P.O. Box 1627, FI-70211 Kuopio, Finland

  • Venue:
  • Software—Practice & Experience
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

We study different efficient implementations of an Aho–Corasick pattern matching automaton when searching for patterns in Unicode text. Much of the previous research has been based on the assumption of a relatively small alphabet, for example the 7-bit ASCII. Our aim is to examine the differences in performance arising from the use of a large alphabet, such as Unicode that is widely used today. The main concern is the representation of the transition function of the pattern matching automaton. We examine and compare array, linked list, hashing, balanced tree, perfect hashing, hybrid, triple-array, and double-array representations. For perfect hashing, we present an algorithm that constructs the hash tables in expected linear time and linear space. We implement the Aho–Corasick automaton in Java using the different transition function representations, and we evaluate their performance. Triple-array and double-array performed best in our experiments, with perfect hashing, hashing, and balanced tree coming next. We discovered that the array implementation has a slow preprocessing time when using the Unicode alphabet. It seems that the use of a large alphabet can slow down the preprocessing time of the automaton considerably depending on the transition function representation used. Copyright © 2006 John Wiley & Sons, Ltd.