Combined word-spacing method for disambiguating korean texts
AI'04 Proceedings of the 17th Australian joint conference on Advances in Artificial Intelligence
Hi-index | 0.00 |
This paper proposes a hybrid automatic word-spacing system for the Korean language, combining stochastic- and knowledge-based approaches. Our system defines the optimal splitting points of an input sentence using two simple parameters: (a) relative word frequency and (b) Syllable n-gram statistics, extracted from large processed corpora that contain 33,643,884 word-tokens. Whereas this method efficiently resolves problems due to eventual data noise using processed training data, and data sparseness using Syllabic n-gram statistics and large corpora, there still remains the problem of processing unseen words, which can hardly be overcome even with a huge corpus. Therefore, this study compensates for the stochastic-based approach, (a) dynamically expanding candidate words with longest-radix selection among possible morphemes and (b) adopting inequivalent treatment between major lexical categories and minor lexical categories. The current combined model remedies drawbacks of the stochastic-based word-spacing algorithm and shows encouraging results: it obtained 97.51% precision in word-unit correction from the external test data.