A hybrid approach to automatic word-spacing in Korean

Authors:
Mi-young Kang;Sung-woo Choi;Hyuk-chul Kwon
Affiliations:
Korean Language Processing Lab, School of Electrical & Computer Engineering, Pusan National University, Busan, Korea;Korean Language Processing Lab, School of Electrical & Computer Engineering, Pusan National University, Busan, Korea;Korean Language Processing Lab, School of Electrical & Computer Engineering, Pusan National University, Busan, Korea
Venue:
IEA/AIE'2004 Proceedings of the 17th international conference on Innovations in applied artificial intelligence
Year:
2004

Citing 0
Cited 2

Combining rule-based learning and memory-based learning for automatic word spacing in simple message service

Applied Soft Computing
Combined word-spacing method for disambiguating korean texts

AI'04 Proceedings of the 17th Australian joint conference on Advances in Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a hybrid automatic word-spacing system for the Korean language, combining stochastic- and knowledge-based approaches. Our system defines the optimal splitting points of an input sentence using two simple parameters: (a) relative word frequency and (b) Syllable n-gram statistics, extracted from large processed corpora that contain 33,643,884 word-tokens. Whereas this method efficiently resolves problems due to eventual data noise using processed training data, and data sparseness using Syllabic n-gram statistics and large corpora, there still remains the problem of processing unseen words, which can hardly be overcome even with a huge corpus. Therefore, this study compensates for the stochastic-based approach, (a) dynamically expanding candidate words with longest-radix selection among possible morphemes and (b) adopting inequivalent treatment between major lexical categories and minor lexical categories. The current combined model remedies drawbacks of the stochastic-based word-spacing algorithm and shows encouraging results: it obtained 97.51% precision in word-unit correction from the external test data.