Automatic filtering of bilingual corpora for statistical machine translation

  • Authors:
  • Shahram Khadivi;Hermann Ney

  • Affiliations:
  • Lehrstuhl für Informatik VI – Computer Science Department, RWTH Aachen University, Aachen, Germany;Lehrstuhl für Informatik VI – Computer Science Department, RWTH Aachen University, Aachen, Germany

  • Venue:
  • NLDB'05 Proceedings of the 10th international conference on Natural Language Processing and Information Systems
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

For many applications such as machine translation and bilingual information retrieval, the bilingual corpora play an important role in training the system. Because they are obtained through automatic or semi automatic methods, they usually include noise, sentence pairs which are worthless or even harmful for training the system. We study the effect of different levels of corpus noise on an end-to-end statistical machine translation system. We also propose an efficient method for corpus filtering. This method filters out the noisy part of a corpus based on the state-of-the-art word alignment models. We show the efficiency of this method on the basis of the sentence misalignment rate of the filtered corpus and its positive effect on the translation quality.