A language independent n-gram model for word segmentation

Authors:
Seung-Shik Kang;Kyu-Baek Hwang
Affiliations:
Department of Computer Science, Kookmin University, Seoul, Korea;School of Computing, Soongsil University, Seoul, Korea
Venue:
AI'06 Proceedings of the 19th Australian joint conference on Artificial Intelligence: advances in Artificial Intelligence
Year:
2006

Citing 4
Cited 0

Automatic word spacing using hidden Markov model for refining Korean text corpora

COLING '02 Proceedings of the 3rd workshop on Asian language resources and international standardization - Volume 12
Combining segmenter and chunker for Chinese word segmentation

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Chinese word segmentation using minimal linguistic knowledge

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Chinese and Japanese word segmentation using word-level and character-level information

COLING '04 Proceedings of the 20th international conference on Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Word segmentation is an essential first step in the processing of far east asian languages (i.e., Chinese, Japanese, and Korean), which heavily influences subsequent processes such as morphological analysis and parsing. One popular method for this task is to learn segmentation patterns, e.g., n-gram features, from corpus data with space-tags attached. However, it is not straightforward to learn reliable patterns, because usual datasets are sparse. Also, coverage and accuracy of learned patterns vary according to many factors such as the value of n, dataset size, and given context. In this paper, we propose an n-gram based reinforcement approach, which alleviates the above problems by step-by-step application of stratified segmentation patterns. In our approach, various n-gram features, for example, unigram, bigram, and trigram cases, are extracted from training corpus and their frequencies are recorded. In the first step, relatively definite segmentations are determined by applying n-gram statistics with tight threshold values. The remaining tags are decided by applying more specific features, considering the previously determined space-tags. In the experiments on Korean sentences, our method achieved much better performance, compared to the existing bigram based model. The proposed approach also showed good performance on Chinese word segmentation, confirming its language-independent effectiveness on far east asian languages.