Detecting highly confident word translations from comparable corpora without any prior knowledge

  • Authors:
  • Ivan Vulić;Marie-Francine Moens

  • Affiliations:
  • KU Leuven Celestijnenlaan, Leuven, Belgium;KU Leuven Celestijnenlaan, Leuven, Belgium

  • Venue:
  • EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we extend the work on using latent cross-language topic models for identifying word translations across comparable corpora. We present a novel precision-oriented algorithm that relies on per-topic word distributions obtained by the bilingual LDA (BiLDA) latent topic model. The algorithm aims at harvesting only the most probable word translations across languages in a greedy fashion, without any prior knowledge about the language pair, relying on a symmetrization process and the one-to-one constraint. We report our results for Italian-English and Dutch-English language pairs that outperform the current state-of-the-art results by a significant margin. In addition, we show how to use the algorithm for the construction of high-quality initial seed lexicons of translations.