Automatic segmentation of bilingual corpora: a comparison of different techniques

  • Authors:
  • Ismael García Varea;Daniel Ortiz;Francisco Nevado;Pedro A. Gómez;Francisco Casacuberta

  • Affiliations:
  • Dpto. de Informática, Universidad de Castilla-La Mancha, Albacete, Spain;Dpto. de Sistemas Informáticos y Computación, Instituto Tecnológico de Informática, Univ. Politécnica de Valencia, Valencia, Spain;Dpto. de Sistemas Informáticos y Computación, Instituto Tecnológico de Informática, Univ. Politécnica de Valencia, Valencia, Spain;Dpto. de Informática, Universidad de Castilla-La Mancha, Albacete, Spain;Dpto. de Sistemas Informáticos y Computación, Instituto Tecnológico de Informática, Univ. Politécnica de Valencia, Valencia, Spain

  • Venue:
  • IbPRIA'05 Proceedings of the Second Iberian conference on Pattern Recognition and Image Analysis - Volume Part II
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Segmentation of bilingual text corpora is a very important issue to deal with in machine translation. In this paper we present a new method to perform bilingual segmentation of a parallel corpus, SPBalign, which is based on phrase-based statistical translation models. The new technique proposed here is compared with other two existing techniques, which are also based on statistical translation methods: the RECalign technique, which is based on the concept of recursive alignment, and the GIATIalign technique, which is based on simple word alignments. Experimental results are obtained for the EuTrans-I English-Spanish task, in order to create new, shorter bilingual segments to be included in a translation memory database. The evaluation of these three methods has been performed comparing the bilingual segmentations obtained by these techniques with respect to a manually segmented bilingual test corpus. These results show us that the new method proposed here outperforms in all cases the two already proposed bilingual segmentation techniques.