Applying Machine Learning to Text Segmentation for Information Retrieval

  • Authors:
  • Xiangji Huang;Fuchun Peng;Dale Schuurmans;Nick Cercone;Stephen E. Robertson

  • Affiliations:
  • School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada, N2L 3G1. jhuang@ai.uwaterloo.ca;School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada, N2L 3G1. f3peng@ai.uwaterloo.ca;School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada, N2L 3G1. dale@ai.uwaterloo.ca;School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada, N2L 3G1. ncercone@ai.uwaterloo.ca;Microsoft Research Ltd., Cambridge, UK and City University, London, UK. ser@microsoft.com

  • Venue:
  • Information Retrieval
  • Year:
  • 2003

Quantified Score

Hi-index 0.01

Visualization

Abstract

We propose a self-supervised word segmentation technique for text segmentation in Chinese information retrieval. This method combines the advantages of traditional dictionary based, character based and mutual information based approaches, while overcoming many of their shortcomings. Experiments on TREC data show this method is promising. Our method is completely language independent and unsupervised, which provides a promising avenue for constructing accurate multi-lingual or cross-lingual information retrieval systems that are flexible and adaptive. We find that although the segmentation accuracy of self-supervised segmentation is not as high as some other segmentation methods, it is enough to give good retrieval performance. It is commonly believed that word segmentation accuracy is monotonically related to retrieval performance in Chinese information retrieval. However, for Chinese, we find that the relationship between segmentation and retrieval performance is in fact nonmonotonic; that is, at around 70% word segmentation accuracy an over-segmentation phenomenon begins to occur which leads to a reduction in information retrieval performance. We demonstrate this effect by presenting an empirical investigation of information retrieval on Chinese TREC data, using a wide variety of word segmentation algorithms with word segmentation accuracies ranging from 44% to 95%, including 70% word segmentation accuracy from our self-supervised word-segmentation approach. It appears that the main reason for the drop in retrieval performance is that correct compounds and collocations are preserved by accurate segmenters, while they are broken up by less accurate (but reasonable) segmenters, to a surprising advantage. This suggests that words themselves might be too broad a notion to conveniently capture the general semantic meaning of Chinese text. Our research suggests machine learning techniques can play an important role in building adaptable information retrieval systems and different evaluation standards for word segmentation should be given to different applications.