Rethinking Chinese word segmentation: tokenization, character classification, or wordbreak identification

Authors:
Chu-Ren Huang;Petr Šimon;Shu-Kai Hsieh;Laurent Prévot
Affiliations:
Institute of Linguistics, Taiwan;Institute of Linguistics, Taiwan;DoFLAL, NIU, Taiwan;Université de Toulouse, France
Venue:
ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Year:
2007

Citing 4
Cited 4

Word identification for Mandarin Chinese sentences

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 1
YaDT: Yet another Decision Tree Builder

ICTAI '04 Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence
The first international Chinese word segmentation Bakeoff

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Adaptive Chinese word segmentation

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics

Chinese term extraction using minimal resources

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Integrating Generative and Discriminative Character-Based Models for Chinese Word Segmentation

ACM Transactions on Asian Language Information Processing (TALIP)
Part-of-speech tagger for Ainu language based on higher order Hidden Markov Model

Expert Systems with Applications: An International Journal
Phrase-based approach for adaptive tokenization

SIGMORPHON '12 Proceedings of the Twelfth Meeting of the Special Interest Group on Computational Morphology and Phonology

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper addresses two remaining challenges in Chinese word segmentation. The challenge in HLT is to find a robust segmentation method that requires no prior lexical knowledge and no extensive training to adapt to new types of data. The challenge in modelling human cognition and acquisition it to segment words efficiently without using knowledge of wordhood. We propose a radical method of word segmentation to meet both challenges. The most critical concept that we introduce is that Chinese word segmentation is the classification of a string of character-boundaries (CB's) into either word-boundaries (WB's) and non-word-boundaries. In Chinese, CB's are delimited and distributed in between two characters. Hence we can use the distributional properties of CB among the background character strings to predict which CB's are WB's.