Comparing different units for query translation in Chinese cross-language information retrieval

  • Authors:
  • Lixin Shi;Jian-Yun Nie;Jing Bai

  • Affiliations:
  • Université de Montréal, Montréal, Québec, Canada;Université de Montréal, Montréal, Québec, Canada;Université de Montréal, Montréal, Québec, Canada

  • Venue:
  • Proceedings of the 2nd international conference on Scalable information systems
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Although both words and n-grams of characters have been used in Chinese IR, they have often been used as two competing methods. For cross-language IR with Chinese, word translation has been used in all previous studies. In this paper, we re-examine the use of n-grams and words for monolingual Chinese IR. We show that both types of indexing unit can be combined within the language modeling framework to produce higher retrieval effectiveness. For CLIR with Chinese, we investigate the possibility of using bigrams and unigrams as translation units. Several translation models from English words to Chinese unigrams, bigrams and words are created based on a parallel corpus. An English query is then translated in several ways, each producing a ranking score. The final ranking score combines all these types of translation. Our experiments on several collections show that Chinese character n-grams are reasonable alternative translation units to words, and they lead to retrieval effectiveness comparable to words. In addition, combinations of both words and n-grams produce higher effectiveness.