Finding the better indexing units for Chinese information retrieval

Authors:
Hongzhao He;Pilian He;Jianfeng Gao;Changning Huang
Affiliations:
Tianjin University, Tianjin, China;Tianjin University, Tianjin, China;Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China
Venue:
SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
Year:
2002

Citing 5
Cited 0

A new character-based indexing method using frequency data for Japanese documents

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Using n-grams for Korean text retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
On Chinese text retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Chinese text retrieval without using a dictionary

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Toward a unified approach to statistical language modeling for Chinese

ACM Transactions on Asian Language Information Processing (TALIP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the processing of Chinese documents and queries in information retrieval (IR), one has to identify the units that are used as indexes. Words and n-grams had been used as indexes in several previous studies, which showed that both kinds of indexes lead to comparable IR performances. In this study, we carried out more experiments to find the better way to index Chinese texts. First, we investigated the inpacts on IR performance of the accuracy of word segmentation. Second, fifteen different groups of indexing units, which were the possible combination of words and character n-grams, were discussed detailedly. Experiments showed that better segmentation results in better IR performances, and a combination of words with uni-grams is the better choice to index Chinese texts for IR.