An Expectation Maximization algorithm for textual unit alignment

  • Authors:
  • Radu Ion;Alexandru Ceauşu;Elena Irimia

  • Affiliations:
  • Research Institute for AI, Bucharest, Romania;Dublin City University, Glasnevin, Dublin, Ireland;Research Institute for AI, Calea Septembrie nr., Bucharest, Romania

  • Venue:
  • BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

The paper presents an Expectation Maximization (EM) algorithm for automatic generation of parallel and quasi-parallel data from any degree of comparable corpora ranging from parallel to weakly comparable. Specifically, we address the problem of extracting related textual units (documents, paragraphs or sentences) relying on the hypothesis that, in a given corpus, certain pairs of translation equivalents are better indicators of a correct textual unit correspondence than other pairs of translation equivalents. We evaluate our method on mixed types of bilingual comparable corpora in six language pairs, obtaining state of the art accuracy figures.