On the use of words and n-grams for Chinese information retrieval

  • Authors:
  • Jian-Yun Nie;Jiangfeng Gao;Jian Zhang;Ming Zhou

  • Affiliations:
  • Département d'informatique et de recherche opérationnelle, Université de Montréal;Microsoft Research;Microsoft Research;Microsoft Research

  • Venue:
  • IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

In the processing of Chinese documents and queries in information retrieval (IR), one has to identify the units that are used as indexes. Words and n-grams have been used as indexes in several previous studies, which showed that both kinds of indexes lead to comparable IR performances. In this study, we carry out more experiments on different ways to segment documents and queries, and to combine words with n-grams. Our experiments show that a combination of the longest-matching algorithm with single characters is the best choice.