An empirical study in source word deletion for phrase-based statistical machine translation

  • Authors:
  • Chi-Ho Li;Dongdong Zhang;Mu Li;Ming Zhou;Hailei Zhang

  • Affiliations:
  • Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China;Northeastern University of China, Shenyang, China

  • Venue:
  • StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
  • Year:
  • 2008

Quantified Score

Hi-index 0.01

Visualization

Abstract

The treatment of 'spurious' words of source language is an important problem but often ignored in the discussion on phrase-based SMT. This paper explains why it is important and why it is not a trivial problem, and proposes three models to handle spurious source words. Experiments show that any source word deletion model can improve a phrase-based system by at least 1.6 BLEU points and the most sophisticated model improves by nearly 2 BLEU points. This paper also explores the impact of training data size and training data domain/genre on source word deletion.