On the use of words and n-grams for Chinese information retrieval
IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
Using self-supervised word segmentation in Chinese information retrieval
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Optimizing search engines using clickthrough data
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A Chinese dictionary construction algorithm for information retrieval
ACM Transactions on Asian Language Information Processing (TALIP)
Chinese word segmentation and its effect on information retrieval
Information Processing and Management: an International Journal
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
A Markov random field model for term dependencies
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
HHMM-based Chinese lexical analyzer ICTCLAS
SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
On GMAP: and other transformations
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
A Heuristic Approach for Segmentation Granularity Problem in Chinese Information Retrieval
ALPIT '07 Proceedings of the Sixth International Conference on Advanced Language Processing and Web Information Technology (ALPIT 2007)
Hi-index | 0.00 |
This paper presents a novel, ranking-style word segmentation approach, called RSVM-Seg, which is well tailored to Chinese information retrieval(CIR). This strategy makes segmentation decision based on the ranking of the internal associative strength between each pair of adjacent characters of the sentence. On the training corpus composed of query items, a ranking model is learned by a widely-used tool Ranking SVM, with some useful statistical features, such as mutual information, difference of t-test, frequency and dictionary information. Experimental results show that, this method is able to eliminate overlapping ambiguity much more effectively, compared to the current word segmentation methods. Furthermore, as this strategy naturally generates segmentation results with different granularity, the performance of CIR systems is improved and achieves the state of the art.