Data cleaning for word alignment

  • Authors:
  • Tsuyoshi Okita

  • Affiliations:
  • Dublin City University, Glasnevin, Dublin

  • Venue:
  • ACLstudent '09 Proceedings of the ACL-IJCNLP 2009 Student Research Workshop
  • Year:
  • 2009

Quantified Score

Hi-index 0.01

Visualization

Abstract

Parallel corpora are made by human beings. However, as an MT system is an aggregation of state-of-the-art NLP technologies without any intervention of human beings, it is unavoidable that quite a few sentence pairs are beyond its analysis and that will therefore not contribute to the system. Furthermore, they in turn may act against our objectives to make the overall performance worse. Possible unfavorable items are n: m mapping objects, such as paraphrases, non-literal translations, and multiword expressions. This paper presents a pre-processing method which detects such unfavorable items before supplying them to the word aligner under the assumption that their frequency is low, such as below 5 percent. We show an improvement of Bleu score from 28.0 to 31.4 in English-Spanish and from 16.9 to 22.1 in German-English.