Leveraging arabic-english bilingual corpora with crowd sourcing-based annotation for arabic-hebrew SMT

Authors:
Manish Gaurav;Guruprasad Saikumar;Amit Srivastava;Premkumar Natarajan;Shankar Ananthakrishnan;Spyros Matsoukas
Affiliations:
Raytheon BBN Technologies, Cambridge, MA;Raytheon BBN Technologies, Cambridge, MA;Raytheon BBN Technologies, Cambridge, MA;Raytheon BBN Technologies, Cambridge, MA;Raytheon BBN Technologies, Cambridge, MA;Raytheon BBN Technologies, Cambridge, MA
Venue:
CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2
Year:
2013

Citing 10
Cited 0

A systematic comparison of various statistical alignment models

Computational Linguistics
The Web as a parallel corpus

Computational Linguistics - Special issue on web as corpus
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Statistical phrase-based translation

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Reliable measures for aligning Japanese-English news articles and sentences

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Improving Machine Translation Performance by Exploiting Non-Parallel Corpora

Computational Linguistics
Improved statistical machine translation using paraphrases

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Revisiting pivot language approach for machine translation

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Can crowds build parallel corpora for machine translation systems?

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Machine translation of Arabic dialects

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent studies in Statistical Machine Translation (SMT) paradigm have been focused on developing foreign language to English translation systems. However as SMT systems have matured, there is a lot of demand to translate from one foreign language to another language. Unfortunately, the availability of parallel training corpora for a pair of morphologically complex foreign languages like Arabic and Hebrew is very scarce. This paper uses active learning based data selection and crowd sourcing technique like Amazon Mechanical Turk to create Arabic-Hebrew parallel corpora. It then explores two different techniques to build Arabic-Hebrew SMT system. The first one involves the traditional cascading of two SMT systems using English as a pivot language. The second approach is training a direct Arabic-Hebrew SMT system using sentence pivoting. Finally, we use a phrase generalization approach to further improve our performance.