Chinese string searching using the KMP algorithm

Authors:
Robert W. P. Luk
Affiliations:
Hong Kong Polytechnic University, Kowloon, Hong Kong
Venue:
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Year:
1996

Citing 5
Cited 0

Fast string searching

Software—Practice & Experience
Fastest pattern matching in strings

Journal of Algorithms
Tight Bounds on the Complexity of the Boyer--Moore String Matching Algorithm

SIAM Journal on Computing
A fast string searching algorithm

Communications of the ACM
Data Structure Techniques

Data Structure Techniques

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper is about the modification of KMP (Knuth, Morris and Pratt) algorithm for string searching of Chinese text. The difficulty is searching through a text string of single-and multi-byte characters. We showed that proper decoding of the input as sequences of characters instead of bytes is necessary. The standard KMP algorithm can easily be modified for Chinese string searching but at the worst-case time-complexity of O(3n) in terms of the number of comparisons. The finite-automaton implementation can achieve worst-case time complexity of O(2n) but constructing the transition table depends on the size of the alphabet, Σ, which is large for Chinese (for Big-5, Σ 13,000). A mapping technique reduces the size the alphabet to at most /P/ where P is the pattern string.