Efficient implementation of Aho–Corasick pattern matching automata using Unicode

Authors:
Janne Nieminen;Pekka Kilpeläinen
Affiliations:
Department of Computer Science, University of Kuopio, P.O. Box 1627, FI-70211 Kuopio, Finland;Department of Computer Science, University of Kuopio, P.O. Box 1627, FI-70211 Kuopio, Finland
Venue:
Software—Practice & Experience
Year:
2007

Citing 15
Cited 2

Storing a Sparse Table with 0(1) Worst Case Access Time

Journal of the ACM (JACM)
Compilers: principles, techniques, and tools

Compilers: principles, techniques, and tools
A practical method for implementing string pattern matching machines

Information Sciences: an International Journal
Randomized algorithms

Randomized algorithms
The Unicode standard, version 2.0

The Unicode standard, version 2.0
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Experiments on string matching in memory structures

Software—Practice & Experience
Minimal perfect hash functions made simple

Communications of the ACM
Storing a sparse table

Communications of the ACM
A fast string searching algorithm

Communications of the ACM
Perfect hashing functions: a single probe retrieving method for static sets

Communications of the ACM
Efficient string matching: an aid to bibliographic search

Communications of the ACM
Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences

Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences
Introduction to Algorithms

Introduction to Algorithms
A String Matching Algorithm Fast on the Average

Proceedings of the 6th Colloquium, on Automata, Languages and Programming

A Table Compression Method for Extended Aho-Corasick Automaton

CIAA '09 Proceedings of the 14th International Conference on Implementation and Application of Automata
Divide and discriminate: algorithm for deterministic and fast hash lookups

Proceedings of the 5th ACM/IEEE Symposium on Architectures for Networking and Communications Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study different efficient implementations of an Aho–Corasick pattern matching automaton when searching for patterns in Unicode text. Much of the previous research has been based on the assumption of a relatively small alphabet, for example the 7-bit ASCII. Our aim is to examine the differences in performance arising from the use of a large alphabet, such as Unicode that is widely used today. The main concern is the representation of the transition function of the pattern matching automaton. We examine and compare array, linked list, hashing, balanced tree, perfect hashing, hybrid, triple-array, and double-array representations. For perfect hashing, we present an algorithm that constructs the hash tables in expected linear time and linear space. We implement the Aho–Corasick automaton in Java using the different transition function representations, and we evaluate their performance. Triple-array and double-array performed best in our experiments, with perfect hashing, hashing, and balanced tree coming next. We discovered that the array implementation has a slow preprocessing time when using the Unicode alphabet. It seems that the use of a large alphabet can slow down the preprocessing time of the automaton considerably depending on the transition function representation used. Copyright © 2006 John Wiley & Sons, Ltd.