Chinese text segmentation for text retrieval: achievements and problems
Journal of the American Society for Information Science
Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Chinese text retrieval without using a dictionary
Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Discovering Chinese words from unsegmented text (poster abstract)
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Using self-supervised word segmentation in Chinese information retrieval
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Self-Supervised Chinese Word Segmentation
IDA '01 Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis
A compression-based algorithm for Chinese word segmentation
Computational Linguistics
Chinese text segmentation with MBDP-1: making the most of training corpora
ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Text classification in Asian languages without word segmentation
AsianIR '03 Proceedings of the sixth international workshop on Information retrieval with Asian languages - Volume 11
Information retrieval oriented word segmentation based on character associative strength ranking
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Optimizing Chinese word segmentation for machine translation performance
StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Experimental study on representing units in Chinese text categorization
CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
A hybrid Chinese information retrieval model
AMT'10 Proceedings of the 6th international conference on Active media technology
A logistic regression-based smoothing method for Chinese text categorization
Expert Systems with Applications: An International Journal
Hi-index | 0.00 |
It is commonly believed that word segmentation accuracy is monotonically related to retrieval performance in Chinese information retrieval. In this paper we show that, for Chinese, the relationship between segmentation and retrieval performance is in fact nonmonotonic; that is, at around 70% word segmentation accuracy an over-segmentation phenomenon begins to occur which leads to a reduction in information retrieval performance. We demonstrate this effect by presenting an empirical investigation of information retrieval on Chinese TREC data, using a wide variety of word segmentation algorithms with word segmentation accuracies ranging from 44% to 95%. It appears that the main reason for the drop in retrieval performance is that correct compounds and collocations are preserved by accurate segmenters, while they are broken up by less accurate (but reasonable) segmenters, to a surprising advantage. This suggests that words themselves might be too broad a notion to conveniently capture the general semantic meaning of Chinese text.