Arabic preprocessing schemes for statistical machine translation

Authors:
Nizar Habash;Fatiha Sadat
Affiliations:
Columbia University;National Research Council of Canada
Venue:
NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
Year:
2006

Citing 6
Cited 40

Minimum error rate training in statistical machine translation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Statistical Machine Translation with Scarce Resources Using Morpho-syntactic Information

Computational Linguistics
Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Improving statistical MT through morphological analysis

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Morphological analysis for statistical machine translation

HLT-NAACL-Short '04 Proceedings of HLT-NAACL 2004: Short Papers
PORTAGE: a phrase-based machine translation system

ParaText '05 Proceedings of the ACL Workshop on Building and Using Parallel Texts

On the impact of morphology in English to Spanish statistical MT

Speech Communication
Segmentation for English-to-Arabic statistical machine translation

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Syntactic phrase reordering for English-to-Arabic statistical machine translation

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
N-gram-based statistical machine translation versus syntax augmented machine translation: comparison and system combination

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Combination of statistical word alignments based on multiple preprocessing schemes

NAACL-Short '07 Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers
Context-dependent alignment models for statistical machine translation

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Using a maximum entropy model to build segmentation lattices for MT

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Minimum Bayes risk combination of translation hypotheses from alternative morphological decompositions

NAACL-Short '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
Syntactic reordering for English-Arabic phrase-based machine translation

Semitic '09 Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages
Coupling hierarchical word reordering and decoding in phrase-based statistical machine translation

SSST '09 Proceedings of the Third Workshop on Syntax and Structure in Statistical Translation
Optimizing Chinese word segmentation for machine translation performance

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
A Gibbs sampler for phrasal synchronous grammar induction

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Symbolic-to-statistical hybridization: extending generation-heavy machine translation

Machine Translation
Overview of Morpho challenge 2008

CLEF'08 Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access
Improved models of distortion cost for statistical machine translation

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Unsupervised search for the optimal segmentation for statistical machine translation

ACLstudent '10 Proceedings of the ACL 2010 Student Research Workshop
Better Arabic parsing: baselines, evaluations, and analysis

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Using TectoMT as a preprocessing tool for phrase-based statistical machine translation

TSD'10 Proceedings of the 13th international conference on Text, speech and dialogue
Factored bilingual n-gram language models for statistical machine translation

Machine Translation
Syntax-based reordering for statistical machine translation

Computer Speech and Language
Unsupervised word alignment with arbitrary features

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Unsupervised bilingual morpheme segmentation and alignment with context-rich hidden semi-Markov models

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Translating from morphologically complex languages: a paraphrase-based approach

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Two ways to use a noisy parallel news corpus for improving statistical machine translation

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Dialectal to standard Arabic paraphrasing to improve Arabic-English statistical machine translation

DIALECTS '11 Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties
Improved Arabic-to-English statistical machine translation by reordering post-verbal subjects for word alignment

Machine Translation
Evaluation of 2-way Iraqi Arabic---English speech translation systems using automated metrics

Machine Translation
Methods for integrating rule-based and statistical systems for Arabic to English machine translation

Machine Translation
Orthographic and morphological processing for English---Arabic statistical machine translation

Machine Translation
A comparison of segmentation methods and extended lexicon models for Arabic statistical machine translation

Machine Translation
The impact of Arabic morphological segmentation on broad-coverage English-to-Arabic statistical machine translation

Machine Translation
Machine translation between Hebrew and Arabic

Machine Translation
English to Arabic statistical machine translation system improvements using preprocessing and Arabic morphology analysis

ACC'11/MMACTEE'11 Proceedings of the 13th IASME/WSEAS international conference on Mathematical Methods and Computational Techniques in Electrical Engineering conference on Applied Computing
English to Arabic statistical machine translation system improvements using preprocessing and Arabic morphology analysis

CIMMACS'11/ISP'11 Proceedings of the 10th WSEAS international conference on Computational Intelligence, Man-Machine Systems and Cybernetics, and proceedings of the 10th WSEAS international conference on Information Security and Privacy
Machine translation of Arabic dialects

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
A class-based agreement model for generating accurately inflected translations

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Twitter translation using translation-based cross-lingual retrieval

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
An empirical study on word segmentation for chinese machine translation

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2
Oracle decoding as a new way to analyze phrase-based machine translation

Machine Translation
Maximum-entropy word alignment and posterior-based phrase extraction for machine translation

Machine Translation

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we study the effect of different word-level preprocessing decisions for Arabic on SMT quality. Our results show that given large amounts of training data, splitting off only proclitics performs best. However, for small amounts of training data, it is best to apply English-like to-kenization using part-of-speech tags, and sophisticated morphological analysis and disambiguation. Moreover, choosing the appropriate preprocessing produces a significant increase in BLEU score if there is a change in genre between training and test data.