Cross-language information retrieval with latent topic models trained on a comparable corpus

  • Authors:
  • Ivan Vulić;Wim De Smet;Marie-Francine Moens

  • Affiliations:
  • Department of Computer Science, K.U. Leuven, Belgium;Department of Computer Science, K.U. Leuven, Belgium;Department of Computer Science, K.U. Leuven, Belgium

  • Venue:
  • AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we study cross-language information retrieval using a bilingual topic model trained on comparable corpora such as Wikipedia articles. The bilingual Latent Dirichlet Allocation model (BiLDA) creates an interlingual representation, which can be used as a translation resource in many different multilingual settings as comparable corpora are available for many language pairs. The probabilistic interlingual representation is incorporated in a statistical language model for information retrieval. Experiments performed on the English and Dutch test datasets of the CLEF 2001-2003 CLIR campaigns show the competitive performance of our approach compared to cross-language retrieval methods that rely on pre-existing translation dictionaries that are hand-built or constructed based on parallel corpora.