Text compression
Chinese text segmentation for text retrieval: achievements and problems
Journal of the American Society for Information Science
Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Comparing representations in Chinese information retrieval
Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Chinese text retrieval without using a dictionary
Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
PAT-tree-based keyword extraction for Chinese information retrieval
Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Employing multiple representations for Chinese information retrieval
Journal of the American Society for Information Science
Foundations of statistical natural language processing
Foundations of statistical natural language processing
Probabilistic latent semantic indexing
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Discovering Chinese words from unsegmented text (poster abstract)
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
On the use of words and n-grams for Chinese information retrieval
IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
Improving English and Chinese Ad-Hoc Retrieval: A Tipster Text Phase 3 Project Report
Information Retrieval
Self-Supervised Chinese Word Segmentation
IDA '01 Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis
A compression-based algorithm for Chinese word segmentation
Computational Linguistics
Chinese text segmentation with MBDP-1: making the most of training corpora
ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
A heuristic method based on a statistical approach for Chinese text segmentation
Journal of the American Society for Information Science and Technology
Interpreting TF-IDF term weights as making relevance decisions
ACM Transactions on Information Systems (TOIS)
Boosting Biomedical Information Retrieval Performance through Citation Graph: An Empirical Study
PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Passage extraction and result combination for genomics information retrieval
Journal of Intelligent Information Systems
An Intelligent information segmentation approach to extract financial data for business valuation
Expert Systems with Applications: An International Journal
Enhancing content-based image retrieval using machine learning techniques
AMT'10 Proceedings of the 6th international conference on Active media technology
Expert Systems with Applications: An International Journal
Enhancing ad-hoc relevance weighting using probability density estimation
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Segmenting eBay item descriptions into coherent sections
Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
Text segmentation based on document understanding for information retrieval
NLDB'07 Proceedings of the 12th international conference on Applications of Natural Language to Information Systems
Electronic word of mouth analysis for service experience
Expert Systems with Applications: An International Journal
Hi-index | 0.01 |
We propose a self-supervised word segmentation technique for text segmentation in Chinese information retrieval. This method combines the advantages of traditional dictionary based, character based and mutual information based approaches, while overcoming many of their shortcomings. Experiments on TREC data show this method is promising. Our method is completely language independent and unsupervised, which provides a promising avenue for constructing accurate multi-lingual or cross-lingual information retrieval systems that are flexible and adaptive. We find that although the segmentation accuracy of self-supervised segmentation is not as high as some other segmentation methods, it is enough to give good retrieval performance. It is commonly believed that word segmentation accuracy is monotonically related to retrieval performance in Chinese information retrieval. However, for Chinese, we find that the relationship between segmentation and retrieval performance is in fact nonmonotonic; that is, at around 70% word segmentation accuracy an over-segmentation phenomenon begins to occur which leads to a reduction in information retrieval performance. We demonstrate this effect by presenting an empirical investigation of information retrieval on Chinese TREC data, using a wide variety of word segmentation algorithms with word segmentation accuracies ranging from 44% to 95%, including 70% word segmentation accuracy from our self-supervised word-segmentation approach. It appears that the main reason for the drop in retrieval performance is that correct compounds and collocations are preserved by accurate segmenters, while they are broken up by less accurate (but reasonable) segmenters, to a surprising advantage. This suggests that words themselves might be too broad a notion to conveniently capture the general semantic meaning of Chinese text. Our research suggests machine learning techniques can play an important role in building adaptable information retrieval systems and different evaluation standards for word segmentation should be given to different applications.