A comparison of Chinese document indexing strategies and retrieval models

  • Authors:
  • Robert W. P. Luk;K. L. Kwok

  • Affiliations:
  • The Hong Kong Polytechnic University, Hong Kong;Queens College, City University of New York

  • Venue:
  • ACM Transactions on Asian Language Information Processing (TALIP)
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the advent of the Internet and intranets, substantial interest is being shown in Asian language information retrieval; especially in Chinese, which is a good example of an Asian ideographic language (other examples include Japanese and Korean). Since, in this type of language, spaces do not delimit words, an important issue is which index terms should be extracted from documents. This issue also has wider implications for indexing other languages such as agglutinating languages (e.g., Finnish and Turkish), archaic ideographic languages like Egyptian hieroglyphs, and other types of information such as data stored in genomic databases. Although comparisons of indexing strategies for Chinese documents have been made, almost all of them are based on a single retrieval model. This article compares the performance of various combinations of indexing strategies (i.e., character, word, short-word, bigram, and Pircs indexing) and retrieval models (i.e., vector space, 2-Poisson, logistic regression, and Pircs models). We determine which model (and its parameters) achieves the (near) best retrieval effectiveness without relevance feedback, and compare it with the open evaluations (i.e., TREC and NTCIR) for both long and title queries. In addition, we describe a more extensive investigation of retrieval efficiency. In particular, the storage cost of word indexing is only slightly more than character indexing, and bigram indexing is about double the storage cost of other indexing strategies. The retrieval time typically varies linearly with the number of unique terms in the query, which is supported by correlation values above 90%. The Pircs retrieval system achieves robust and good retrieval performance, but it appears to be the slowest method, whereas vector space models were not very effective in retrieval, but were able to respond quickly. For robust, near-best retrieval effectiveness, without considering storage overhead, the 2-Poisson model using bigram indexing appears to be a good compromise between retrieval effectiveness and efficiency for both long and title queries.