Identifying word translations from comparable corpora using latent topic models

  • Authors:
  • Ivan Vulić;Wim De Smet;Marie-Francine Moens

  • Affiliations:
  • K.U. Leuven, Celestijnenlaan, Leuven, Belgium;K.U. Leuven, Celestijnenlaan, Leuven, Belgium;K.U. Leuven, Celestijnenlaan, Leuven, Belgium

  • Venue:
  • HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

A topic model outputs a set of multinomial distributions over words for each topic. In this paper, we investigate the value of bilingual topic models, i.e., a bilingual Latent Dirichlet Allocation model for finding translations of terms in comparable corpora without using any linguistic resources. Experiments on a document-aligned English-Italian Wikipedia corpus confirm that the developed methods which only use knowledge from word-topic distributions outperform methods based on similarity measures in the original word-document space. The best results, obtained by combining knowledge from word-topic distributions with similarity measures in the original space, are also reported.