A language independent n-gram model for word segmentation

  • Authors:
  • Seung-Shik Kang;Kyu-Baek Hwang

  • Affiliations:
  • Department of Computer Science, Kookmin University, Seoul, Korea;School of Computing, Soongsil University, Seoul, Korea

  • Venue:
  • AI'06 Proceedings of the 19th Australian joint conference on Artificial Intelligence: advances in Artificial Intelligence
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Word segmentation is an essential first step in the processing of far east asian languages (i.e., Chinese, Japanese, and Korean), which heavily influences subsequent processes such as morphological analysis and parsing. One popular method for this task is to learn segmentation patterns, e.g., n-gram features, from corpus data with space-tags attached. However, it is not straightforward to learn reliable patterns, because usual datasets are sparse. Also, coverage and accuracy of learned patterns vary according to many factors such as the value of n, dataset size, and given context. In this paper, we propose an n-gram based reinforcement approach, which alleviates the above problems by step-by-step application of stratified segmentation patterns. In our approach, various n-gram features, for example, unigram, bigram, and trigram cases, are extracted from training corpus and their frequencies are recorded. In the first step, relatively definite segmentations are determined by applying n-gram statistics with tight threshold values. The remaining tags are decided by applying more specific features, considering the previously determined space-tags. In the experiments on Korean sentences, our method achieved much better performance, compared to the existing bigram based model. The proposed approach also showed good performance on Chinese word segmentation, confirming its language-independent effectiveness on far east asian languages.