Bitext alignment for statistical machine translation

  • Authors:
  • William Byrne;Yonggang Deng

  • Affiliations:
  • The Johns Hopkins University;The Johns Hopkins University

  • Venue:
  • Bitext alignment for statistical machine translation
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Bitext alignment is the task of finding translation equivalence between documents in two languages, collections of which are commonly known as bitext. This dissertation addresses the problems of statistical alignment at various granularities from sentence to word with the goal of creating Statistical Machine Translation (SMT) systems. SMT systems are statistical pattern processors based on parameterized models estimated from aligned bitext training collections. The collections are large enough that alignments must be created using automatic methods. The bitext collections are often available as aligned documents, such as news stories, which usually need to be further aligned at the sentence level and the word level before statistics can be extracted from the bitext. We develop statistical models that are learned from data in an unsupervised way. Language independent alignment algorithms are derived for efficiency and effectiveness. We first address the problem of extracting bitext chunk pairs, which are translation segments at the sentence or sub-sentence level. To extract these bitext chunk pairs, we formulate a model of translation as a stochastic generative model over parallel documents, and derive several different alignment procedures through various formulations of the component distributions. Based on these models we propose a hierarchical chunking procedure that produces chunk pairs by a series of alignment operations in which coarse alignment of large sections of text is followed by a more detailed alignment of their subsections. We show practical benefits with this chunking scheme, observing in particular that it makes efficient use of bitext by aligning sections of text that simpler procedures would discard as spurious. For the problem of word alignment in bitext, we propose a novel Hidden Markov Model based Word-to-Phrase (WtoP) alignment model, which is formulated so that alignment and parameter estimation can be performed efficiently using standard HMM algorithms. We find that the word alignment performance of the WtoP model is comparable to that of IBM Model-4, currently considered the state of the art, even in processing large bitext collections. We use this Word-to-Phrase model to define a posterior distribution over translation phrase pairs in the bitext, and develop a phrase-pair extraction procedure based on this posterior distribution. We show that this use of the phrase translation posterior distribution allows us to extract a richer inventory of phrases than results from with current techniques. In the evaluation of large Chinese-English SMT systems, we find that systems derived from word-aligned bitext created using the WtoP model perform comparably to systems derived from Model-4 word alignments, and in Arabic-English we find significant gains from using WtoP alignments.