A hybrid approach to automatic word-spacing in Korean

  • Authors:
  • Mi-young Kang;Sung-woo Choi;Hyuk-chul Kwon

  • Affiliations:
  • Korean Language Processing Lab, School of Electrical & Computer Engineering, Pusan National University, Busan, Korea;Korean Language Processing Lab, School of Electrical & Computer Engineering, Pusan National University, Busan, Korea;Korean Language Processing Lab, School of Electrical & Computer Engineering, Pusan National University, Busan, Korea

  • Venue:
  • IEA/AIE'2004 Proceedings of the 17th international conference on Innovations in applied artificial intelligence
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper proposes a hybrid automatic word-spacing system for the Korean language, combining stochastic- and knowledge-based approaches. Our system defines the optimal splitting points of an input sentence using two simple parameters: (a) relative word frequency and (b) Syllable n-gram statistics, extracted from large processed corpora that contain 33,643,884 word-tokens. Whereas this method efficiently resolves problems due to eventual data noise using processed training data, and data sparseness using Syllabic n-gram statistics and large corpora, there still remains the problem of processing unseen words, which can hardly be overcome even with a huge corpus. Therefore, this study compensates for the stochastic-based approach, (a) dynamically expanding candidate words with longest-radix selection among possible morphemes and (b) adopting inequivalent treatment between major lexical categories and minor lexical categories. The current combined model remedies drawbacks of the stochastic-based word-spacing algorithm and shows encouraging results: it obtained 97.51% precision in word-unit correction from the external test data.