PM-based indexing for Chinese text retrieval

Authors:
Du Lin;Zhang Yibo;Sun Le;Sun Yufang;Han Jie
Affiliations:
Institute of Software, Chinese Academy of Sciences, Beijing, P.R.China;Institute of Software, Chinese Academy of Sciences, Beijing, P.R.China;Institute of Software, Chinese Academy of Sciences, Beijing, P.R.China;Institute of Software, Chinese Academy of Sciences, Beijing, P.R.China;Institute of Software, Chinese Academy of Sciences, Beijing, P.R.China
Venue:
IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
Year:
2000

Citing 4
Cited 2

Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Fast and quasi-natural language search for gigabytes of Chinese texts

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
A new statistical formula for Chinese text segmentation incorporating contextual information

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval

A comparison of Chinese document indexing strategies and retrieval models

ACM Transactions on Asian Language Information Processing (TALIP)
Automatic construction of Chinese stop word list

ACOS'06 Proceedings of the 5th WSEAS international conference on Applied computer science

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper focused on introducing a novel PM indexing schema for Chinese text retrieval. Different with the Western languages, there is no delimiter between words in Chinese texts. The indexing is based either on the characters or on the segmented words. For the word-based indexing, the out-of-vocabulary words, such as the proper nouns, or domain terminology, are usually mis-segmented due to the limited vocabulary coverage of the segmentation dictionaries and thus impair the query precision. In this paper, several indexing and ranking methods, including the novel PM-based ranking, were tested so as to compare their efficiency in dealing with the new words in Chinese text retrieval. The experiment has shown that the query precision of the PM + word method is 10% higher than the word indexing.