Arabic preprocessing schemes for statistical machine translation

  • Authors:
  • Nizar Habash;Fatiha Sadat

  • Affiliations:
  • Columbia University;National Research Council of Canada

  • Venue:
  • NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we study the effect of different word-level preprocessing decisions for Arabic on SMT quality. Our results show that given large amounts of training data, splitting off only proclitics performs best. However, for small amounts of training data, it is best to apply English-like to-kenization using part-of-speech tags, and sophisticated morphological analysis and disambiguation. Moreover, choosing the appropriate preprocessing produces a significant increase in BLEU score if there is a change in genre between training and test data.