Storing a Sparse Table with 0(1) Worst Case Access Time
Journal of the ACM (JACM)
Compilers: principles, techniques, and tools
Compilers: principles, techniques, and tools
A practical method for implementing string pattern matching machines
Information Sciences: an International Journal
Randomized algorithms
The Unicode standard, version 2.0
The Unicode standard, version 2.0
Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
Experiments on string matching in memory structures
Software—Practice & Experience
Minimal perfect hash functions made simple
Communications of the ACM
Communications of the ACM
A fast string searching algorithm
Communications of the ACM
Perfect hashing functions: a single probe retrieving method for static sets
Communications of the ACM
Efficient string matching: an aid to bibliographic search
Communications of the ACM
Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences
Introduction to Algorithms
A String Matching Algorithm Fast on the Average
Proceedings of the 6th Colloquium, on Automata, Languages and Programming
A Table Compression Method for Extended Aho-Corasick Automaton
CIAA '09 Proceedings of the 14th International Conference on Implementation and Application of Automata
Divide and discriminate: algorithm for deterministic and fast hash lookups
Proceedings of the 5th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
Hi-index | 0.00 |
We study different efficient implementations of an Aho–Corasick pattern matching automaton when searching for patterns in Unicode text. Much of the previous research has been based on the assumption of a relatively small alphabet, for example the 7-bit ASCII. Our aim is to examine the differences in performance arising from the use of a large alphabet, such as Unicode that is widely used today. The main concern is the representation of the transition function of the pattern matching automaton. We examine and compare array, linked list, hashing, balanced tree, perfect hashing, hybrid, triple-array, and double-array representations. For perfect hashing, we present an algorithm that constructs the hash tables in expected linear time and linear space. We implement the Aho–Corasick automaton in Java using the different transition function representations, and we evaluate their performance. Triple-array and double-array performed best in our experiments, with perfect hashing, hashing, and balanced tree coming next. We discovered that the array implementation has a slow preprocessing time when using the Unicode alphabet. It seems that the use of a large alphabet can slow down the preprocessing time of the automaton considerably depending on the transition function representation used. Copyright © 2006 John Wiley & Sons, Ltd.