Chinese text segmentation: A hybrid approach using transductive learning and statistical association measures

Authors:
Richard Tzong-Han Tsai
Affiliations:
Department of Computer Science and Engineering, Yuan Ze University, Taiwan
Venue:
Expert Systems with Applications: An International Journal
Year:
2010

Citing 9
Cited 2

Word association norms, mutual information, and lexicography

Computational Linguistics
Employing multiple representations for Chinese information retrieval

Journal of the American Society for Information Science
Using statistical and contextual information to identify two-and three-character words in Chinese text

Journal of the American Society for Information Science and Technology
Self-Supervised Chinese Word Segmentation

IDA '01 Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis
A bottom-up merging algorithm for Chinese unknown word extraction

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
The first international Chinese word segmentation Bakeoff

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Chinese word segmentation as LMR tagging

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Adaptive Chinese word segmentation

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Chinese segmentation and new word detection using conditional random fields

COLING '04 Proceedings of the 20th international conference on Computational Linguistics

Unsupervised overlapping feature selection for conditional random fields learning in Chinese word segmentation

ROCLING '11 Proceedings of the 23rd Conference on Computational Linguistics and Speech Processing
Integrating statistical and lexical information for recognizing textual entailments in text

Knowledge-Based Systems

Quantified Score

Hi-index	12.05

Visualization

Abstract

Chinese text segmentation (CTS) is a fundamental step in building any Chinese or cross-language information retrieval system. This paper identifies and proposes solutions to two main challenges facing today's CTS systems: segmenting words longer than the context window and identifying words not derived from affixation or composition. Our methods exploit unlabeled data, making them scalable at little extra cost. To tackle the first problem, we use a transductive learning approach to automatically construct a dictionary, and then refine it by improving its test set coverage while reducing its over-fitting tendency. In addition, we incorporate frequency information to discriminate overlapping matching words. For the second problem, we employ statistical association measures non-parametrically through a natural but novel feature representation scheme. To demonstrate the generality of our approach, we verify our system on the most reputable CTS evaluation standard - the SIGHAN bakeoff, which contains datasets in both traditional and simplified Chinese. These datasets are provided by representative academic or industrial research institutes. The experimental results show that with only training data and unlabeled test data and with no external dictionaries, our approach effectively overcomes the above-mentioned problems and reduces segmentation errors by an average of 27.8% compared with the traditional approach. Notably, our approach improves the recall of new words, the most informative words, by 4.7% on average. Also, our approach outperforms the best SIGHAN CTS system, which requires many external resources. Additional analysis shows that our approach has the potential to gain accuracy as the test data increases.