Information retrieval oriented word segmentation based on character associative strength ranking

  • Authors:
  • Yixuan Liu;Bin Wang;Fan Ding;Sheng Xu

  • Affiliations:
  • Institute of Computing Technology, Chinese Academy of Sciences, Beijing, P.R. China;Institute of Computing Technology, Chinese Academy of Sciences, Beijing, P.R. China;Institute of Computing Technology, Chinese Academy of Sciences, Beijing, P.R. China;Institute of Computing Technology, Chinese Academy of Sciences, Beijing, P.R. China

  • Venue:
  • EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents a novel, ranking-style word segmentation approach, called RSVM-Seg, which is well tailored to Chinese information retrieval(CIR). This strategy makes segmentation decision based on the ranking of the internal associative strength between each pair of adjacent characters of the sentence. On the training corpus composed of query items, a ranking model is learned by a widely-used tool Ranking SVM, with some useful statistical features, such as mutual information, difference of t-test, frequency and dictionary information. Experimental results show that, this method is able to eliminate overlapping ambiguity much more effectively, compared to the current word segmentation methods. Furthermore, as this strategy naturally generates segmentation results with different granularity, the performance of CIR systems is improved and achieves the state of the art.