Applying Machine Learning to Text Segmentation for Information Retrieval

Authors:
Xiangji Huang;Fuchun Peng;Dale Schuurmans;Nick Cercone;Stephen E. Robertson
Affiliations:
School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada, N2L 3G1. jhuang@ai.uwaterloo.ca;School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada, N2L 3G1. f3peng@ai.uwaterloo.ca;School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada, N2L 3G1. dale@ai.uwaterloo.ca;School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada, N2L 3G1. ncercone@ai.uwaterloo.ca;Microsoft Research Ltd., Cambridge, UK and City University, London, UK. ser@microsoft.com
Venue:
Information Retrieval
Year:
2003

Citing 16
Cited 11

Text compression

Text compression
Chinese text segmentation for text retrieval: achievements and problems

Journal of the American Society for Information Science
Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
On Chinese text retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Comparing representations in Chinese information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Chinese text retrieval without using a dictionary

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
PAT-tree-based keyword extraction for Chinese information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Employing multiple representations for Chinese information retrieval

Journal of the American Society for Information Science
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Discovering Chinese words from unsegmented text (poster abstract)

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
On the use of words and n-grams for Chinese information retrieval

IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
Improving English and Chinese Ad-Hoc Retrieval: A Tipster Text Phase 3 Project Report

Information Retrieval
Self-Supervised Chinese Word Segmentation

IDA '01 Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis
A compression-based algorithm for Chinese word segmentation

Computational Linguistics
Chinese text segmentation with MBDP-1: making the most of training corpora

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics

A heuristic method based on a statistical approach for Chinese text segmentation

Journal of the American Society for Information Science and Technology
Interpreting TF-IDF term weights as making relevance decisions

ACM Transactions on Information Systems (TOIS)
Boosting Biomedical Information Retrieval Performance through Citation Graph: An Empirical Study

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Passage extraction and result combination for genomics information retrieval

Journal of Intelligent Information Systems
An Intelligent information segmentation approach to extract financial data for business valuation

Expert Systems with Applications: An International Journal
Enhancing content-based image retrieval using machine learning techniques

AMT'10 Proceedings of the 6th international conference on Active media technology
Incorporating rich features to boost information retrieval performance: A SVM-regression based re-ranking approach

Expert Systems with Applications: An International Journal
Enhancing ad-hoc relevance weighting using probability density estimation

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Segmenting eBay item descriptions into coherent sections

Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
Text segmentation based on document understanding for information retrieval

NLDB'07 Proceedings of the 12th international conference on Applications of Natural Language to Information Systems
Electronic word of mouth analysis for service experience

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.01

Visualization

Abstract

We propose a self-supervised word segmentation technique for text segmentation in Chinese information retrieval. This method combines the advantages of traditional dictionary based, character based and mutual information based approaches, while overcoming many of their shortcomings. Experiments on TREC data show this method is promising. Our method is completely language independent and unsupervised, which provides a promising avenue for constructing accurate multi-lingual or cross-lingual information retrieval systems that are flexible and adaptive. We find that although the segmentation accuracy of self-supervised segmentation is not as high as some other segmentation methods, it is enough to give good retrieval performance. It is commonly believed that word segmentation accuracy is monotonically related to retrieval performance in Chinese information retrieval. However, for Chinese, we find that the relationship between segmentation and retrieval performance is in fact nonmonotonic; that is, at around 70% word segmentation accuracy an over-segmentation phenomenon begins to occur which leads to a reduction in information retrieval performance. We demonstrate this effect by presenting an empirical investigation of information retrieval on Chinese TREC data, using a wide variety of word segmentation algorithms with word segmentation accuracies ranging from 44% to 95%, including 70% word segmentation accuracy from our self-supervised word-segmentation approach. It appears that the main reason for the drop in retrieval performance is that correct compounds and collocations are preserved by accurate segmenters, while they are broken up by less accurate (but reasonable) segmenters, to a surprising advantage. This suggests that words themselves might be too broad a notion to conveniently capture the general semantic meaning of Chinese text. Our research suggests machine learning techniques can play an important role in building adaptable information retrieval systems and different evaluation standards for word segmentation should be given to different applications.