Maximum-entropy word alignment and posterior-based phrase extraction for machine translation

  • Authors:
  • Nadi Tomeh;Alexandre Allauzen;François Yvon

  • Affiliations:
  • LIMSI---CNRS, Université Paris-Sud 11, Orsay, France 91400;LIMSI---CNRS, Université Paris-Sud 11, Orsay, France 91400;LIMSI---CNRS, Université Paris-Sud 11, Orsay, France 91400

  • Venue:
  • Machine Translation
  • Year:
  • 2014

Quantified Score

Hi-index 0.00

Visualization

Abstract

One of the fundamental assumptions in statistical machine translation (SMT) is that the correspondence between a sentence and its translation can be explained in terms of an alignment between their words. Such alignment information is typically not observed in the parallel corpora used to build the phrase table of an SMT system. Therefore, it is customary to estimate a probabilistic model of the assumed hidden word alignment, which is then used to extract bilingual phrase pairs. In standard extraction heuristics, the alignment model is under-exploited as the only information used from the posterior distribution is the Viterbi best alignment. This is due to the high computational complexity of the IBM models, which are the de facto standard for computing these alignments. Note that these models have other limitations, including their asymmetry and their inability to integrate rich, feature-based, descriptions. We argue that refining the word alignment model in a discriminative maximum-entropy framework substantially improves the alignment quality. We also show that these improved alignments combined with efficient and accurate computation of the link posterior distributions can also improve the overall translation performance, especially when applying posterior-based extraction methods.