Korean stochastic word-spacing with dynamic expansion of candidate words list

Authors:
Mi-young Kang;Sung-ja Choi;Ae-sun Yoon;Hyuk-chul Kwon
Affiliations:
Korean Language Processing Lab, School of Electrical & Computer Engineering, Pusan National University, Busan, Korea;Korean Language Processing Lab, School of Electrical & Computer Engineering, Pusan National University, Busan, Korea;Korean Language Processing Lab, School of Electrical & Computer Engineering, Pusan National University, Busan, Korea;Korean Language Processing Lab, School of Electrical & Computer Engineering, Pusan National University, Busan, Korea
Venue:
IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Year:
2004

Citing 2
Cited 0

Foundations of statistical natural language processing

Foundations of statistical natural language processing
Potential Governing Relationship and a Korean Grammar Checker Using Partial Parsing

IEA/AIE '02 Proceedings of the 15th international conference on Industrial and engineering applications of artificial intelligence and expert systems: developments in applied artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

The main aim of this work is to implement stochastic Korean Word-Spacing System which is equally robust for both inner-data and external-data. Word-spacing in Korean is influential in deciding semantic and syntactic scope. In order to cope with various problem yielded by word-spacing errors while processing Korean text, this study (a) presents a simple stochastic word-spacing system with only two parameters using relative word-unigram frequencies and odds favoring the inner-spacing probability of disyllables located at the boundary of stochastic-based words; (b) endeavors to diminish training-data-dependency by dynamically creating candidate words list with the longest-radix-selecting algorithm and (c) removes noise from the training-data by refining training procedure. The system thus becomes robust against unseen words and offers similar performance for both inner-data and external-data: it obtained 98.35% and 97.47% precision in word-unit correction from the inner test-data and the external test-data, respectively.