Information retrieval oriented word segmentation based on character associative strength ranking

Authors:
Yixuan Liu;Bin Wang;Fan Ding;Sheng Xu
Affiliations:
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, P.R. China;Institute of Computing Technology, Chinese Academy of Sciences, Beijing, P.R. China;Institute of Computing Technology, Chinese Academy of Sciences, Beijing, P.R. China;Institute of Computing Technology, Chinese Academy of Sciences, Beijing, P.R. China
Venue:
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Year:
2008

Citing 11
Cited 0

On the use of words and n-grams for Chinese information retrieval

IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
Using self-supervised word segmentation in Chinese information retrieval

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Optimizing search engines using clickthrough data

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A Chinese dictionary construction algorithm for information retrieval

ACM Transactions on Asian Language Information Processing (TALIP)
Chinese word segmentation and its effect on information retrieval

Information Processing and Management: an International Journal
Investigating the relationship between word segmentation performance and retrieval performance in Chinese IR

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
A Markov random field model for term dependencies

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
HHMM-based Chinese lexical analyzer ICTCLAS

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
On GMAP: and other transformations

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
A Heuristic Approach for Segmentation Granularity Problem in Chinese Information Retrieval

ALPIT '07 Proceedings of the Sixth International Conference on Advanced Language Processing and Web Information Technology (ALPIT 2007)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a novel, ranking-style word segmentation approach, called RSVM-Seg, which is well tailored to Chinese information retrieval(CIR). This strategy makes segmentation decision based on the ranking of the internal associative strength between each pair of adjacent characters of the sentence. On the training corpus composed of query items, a ranking model is learned by a widely-used tool Ranking SVM, with some useful statistical features, such as mutual information, difference of t-test, frequency and dictionary information. Experimental results show that, this method is able to eliminate overlapping ambiguity much more effectively, compared to the current word segmentation methods. Furthermore, as this strategy naturally generates segmentation results with different granularity, the performance of CIR systems is improved and achieves the state of the art.