Chinese word segmentation as morpheme-based lexical chunking

Authors:
Guohong Fu;Chunyu Kit;Jonathan J. Webster
Affiliations:
School of Computer Science and Technology, Heilongjiang University, Harbin 150080, PR China and Department of Chinese, Translation and Linguistics, City University of Hong Kong, 83 Tat Chee Avenue ...;Department of Chinese, Translation and Linguistics, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong SAR, PR China;Department of Chinese, Translation and Linguistics, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong SAR, PR China
Venue:
Information Sciences: an International Journal
Year:
2008

Citing 17
Cited 5

ACTS: an automatic Chinese text segmentation system for full text retrieval

Journal of the American Society for Information Science
An intelligent full-text Chinese-English translation system

Information Sciences—Applications: An International Journal
Shallow parsing using specialized hmms

The Journal of Machine Learning Research
A compression-based algorithm for Chinese word segmentation

Computational Linguistics
Chinese word segmentation and its effect on information retrieval

Information Processing and Management: an International Journal
A trainable rule-based algorithm for word segmentation

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Lexicalized hidden Markov models for part-of-speech tagging

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Tokenization as the initial phase in NLP

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 4
A Chinese word segmentation based on language situation in processing ambiguous words

Information Sciences: an International Journal
Chinese named entity recognition using lexicalized HMMs

ACM SIGKDD Explorations Newsletter - Natural language processing and text mining
Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach

Computational Linguistics
Design of Chinese morphological analyzer

SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
Automatic recognition of Chinese unknown words based on roles tagging

SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
The first international Chinese word segmentation Bakeoff

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Chinese segmentation and new word detection using conditional random fields

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Subword-based tagging for confidence-dependent Chinese word segmentation

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Semantic passage segmentation based on sentence topics for question answering

Information Sciences: an International Journal

Minimum tag error for discriminative training of conditional random fields

Information Sciences: an International Journal
Integrating unsupervised and supervised word segmentation: The role of goodness measures

Information Sciences: an International Journal
Chinese sentence-level sentiment classification based on fuzzy sets

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
On the effectiveness of subwords for lexical cohesion based story segmentation of Chinese broadcast news

Information Sciences: an International Journal
Integrating Generative and Discriminative Character-Based Models for Chinese Word Segmentation

ACM Transactions on Asian Language Information Processing (TALIP)

Quantified Score

Hi-index	0.07

Visualization

Abstract

Chinese word segmentation plays an important role in many Chinese language processing tasks such as information retrieval and text mining. Recent research in Chinese word segmentation focuses on tagging approaches with either characters or words as tagging units. In this paper we present a morpheme-based chunking approach and implement it in a two-stage system. It consists of two main components, namely a morpheme segmentation component to segment an input sentence to a sequence of morphemes based on morpheme-formation models and bigram language models, and a lexical chunking component to label each segmented morpheme's position in a word of a special type with the aid of lexicalized hidden Markov models. To facilitate these tasks, a statistically-based technique is also developed for automatically compiling a morpheme dictionary from a segmented or tagged corpus. To evaluate this approach, we conduct a closed test and an open test using the 2005 SIGHAN Bakeoff data. Our system demonstrates state-of-the-art performance on different test sets, showing the benefits of choosing morphemes as tagging units. Furthermore, the open test results indicate significant performance enhancement using lexicalization and part-of-speech features.