Investigating the relationship between word segmentation performance and retrieval performance in Chinese IR

Authors:
Fuchun Peng;Xiangji Huang;Dale Schuurmans;Nick Cercone
Affiliations:
University of Waterloo, Waterloo, Ontario, Canada;University of Waterloo, Waterloo, Ontario, Canada;University of Waterloo, Waterloo, Ontario, Canada;University of Waterloo, Waterloo, Ontario, Canada
Venue:
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Year:
2002

Citing 9
Cited 6

Chinese text segmentation for text retrieval: achievements and problems

Journal of the American Society for Information Science
Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
On Chinese text retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Chinese text retrieval without using a dictionary

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Discovering Chinese words from unsegmented text (poster abstract)

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Using self-supervised word segmentation in Chinese information retrieval

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Self-Supervised Chinese Word Segmentation

IDA '01 Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis
A compression-based algorithm for Chinese word segmentation

Computational Linguistics
Chinese text segmentation with MBDP-1: making the most of training corpora

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics

Text classification in Asian languages without word segmentation

AsianIR '03 Proceedings of the sixth international workshop on Information retrieval with Asian languages - Volume 11
Information retrieval oriented word segmentation based on character associative strength ranking

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Optimizing Chinese word segmentation for machine translation performance

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Experimental study on representing units in Chinese text categorization

CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
A hybrid Chinese information retrieval model

AMT'10 Proceedings of the 6th international conference on Active media technology
A logistic regression-based smoothing method for Chinese text categorization

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

It is commonly believed that word segmentation accuracy is monotonically related to retrieval performance in Chinese information retrieval. In this paper we show that, for Chinese, the relationship between segmentation and retrieval performance is in fact nonmonotonic; that is, at around 70% word segmentation accuracy an over-segmentation phenomenon begins to occur which leads to a reduction in information retrieval performance. We demonstrate this effect by presenting an empirical investigation of information retrieval on Chinese TREC data, using a wide variety of word segmentation algorithms with word segmentation accuracies ranging from 44% to 95%. It appears that the main reason for the drop in retrieval performance is that correct compounds and collocations are preserved by accurate segmenters, while they are broken up by less accurate (but reasonable) segmenters, to a surprising advantage. This suggests that words themselves might be too broad a notion to conveniently capture the general semantic meaning of Chinese text.