Korean stochastic word-spacing with dynamic expansion of candidate words list

  • Authors:
  • Mi-young Kang;Sung-ja Choi;Ae-sun Yoon;Hyuk-chul Kwon

  • Affiliations:
  • Korean Language Processing Lab, School of Electrical & Computer Engineering, Pusan National University, Busan, Korea;Korean Language Processing Lab, School of Electrical & Computer Engineering, Pusan National University, Busan, Korea;Korean Language Processing Lab, School of Electrical & Computer Engineering, Pusan National University, Busan, Korea;Korean Language Processing Lab, School of Electrical & Computer Engineering, Pusan National University, Busan, Korea

  • Venue:
  • IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

The main aim of this work is to implement stochastic Korean Word-Spacing System which is equally robust for both inner-data and external-data. Word-spacing in Korean is influential in deciding semantic and syntactic scope. In order to cope with various problem yielded by word-spacing errors while processing Korean text, this study (a) presents a simple stochastic word-spacing system with only two parameters using relative word-unigram frequencies and odds favoring the inner-spacing probability of disyllables located at the boundary of stochastic-based words; (b) endeavors to diminish training-data-dependency by dynamically creating candidate words list with the longest-radix-selecting algorithm and (c) removes noise from the training-data by refining training procedure. The system thus becomes robust against unseen words and offers similar performance for both inner-data and external-data: it obtained 98.35% and 97.47% precision in word-unit correction from the inner test-data and the external test-data, respectively.