Machine translation of Arabic dialects

  • Authors:
  • Rabih Zbib;Erika Malchiodi;Jacob Devlin;David Stallard;Spyros Matsoukas;Richard Schwartz;John Makhoul;Omar F. Zaidan;Chris Callison-Burch

  • Affiliations:
  • Raytheon BBN Technologies, Cambridge MA;Raytheon BBN Technologies, Cambridge MA;Raytheon BBN Technologies, Cambridge MA;Raytheon BBN Technologies, Cambridge MA;Raytheon BBN Technologies, Cambridge MA;Raytheon BBN Technologies, Cambridge MA;Raytheon BBN Technologies, Cambridge MA;Microsoft Research, Redmond, WA;Johns Hopkins University, Baltimore MD

  • Venue:
  • NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Arabic Dialects present many challenges for machine translation, not least of which is the lack of data resources. We use crowdsourcing to cheaply and quickly build Levantine-English and Egyptian-English parallel corpora, consisting of 1.1M words and 380k words, respectively. The dialectal sentences are selected from a large corpus of Arabic web text, and translated using Amazon's Mechanical Turk. We use this data to build Dialectal Arabic MT systems, and find that small amounts of dialectal data have a dramatic impact on translation quality. When translating Egyptian and Levantine test sets, our Dialectal Arabic MT system performs 6.3 and 7.0 BLEU points higher than a Modern Standard Arabic MT system trained on a 150M-word Arabic-English parallel corpus.