A novel word segmentation approach for written languages with word boundary markers

Authors:
Han-Cheol Cho;Do-Gil Lee;Jung-Tae Lee;Pontus Stenetorp;Jun'ichi Tsujii;Hae-Chang Rim
Affiliations:
The University of Tokyo, Tokyo, Japan;Korea University, Seoul, Korea;Korea University, Seoul, Korea;The University of Tokyo, Tokyo, Japan;The University of Tokyo, Tokyo, Japan;Korea University, Seoul, Korea
Venue:
ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Year:
2009

Citing 3
Cited 0

Tagging English text with a probabilistic model

Computational Linguistics
Chinese and Japanese word segmentation using word-level and character-level information

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Automatic Word Spacing Using Probabilistic Models Based on Character n-grams

IEEE Intelligent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most NLP applications work under the assumption that a user input is error-free; thus, word segmentation (WS) for written languages that use word boundary markers (WBMs), such as spaces, has been regarded as a trivial issue. However, noisy real-world texts, such as blogs, e-mails, and SMS, may contain spacing errors that require correction before further processing may take place. For the Korean language, many researchers have adopted a traditional WS approach, which eliminates all spaces in the user input and re-inserts proper word boundaries. Unfortunately, such an approach often exacerbates the word spacing quality for user input, which has few or no spacing errors; such is the case, because a perfect WS model does not exist. In this paper, we propose a novel WS method that takes into consideration the initial word spacing information of the user input. Our method generates a better output than the original user input, even if the user input has few spacing errors. Moreover, the proposed method significantly outperforms a state-of-the-art Korean WS model when the user input initially contains less than 10% spacing errors, and performs comparably for cases containing more spacing errors. We believe that the proposed method will be a very practical pre-processing module.