Statistical machine translation into a morphologically complex language

Authors:
Kemal Oflazer
Affiliations:
Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul, Tuzla, Turkey
Venue:
CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
Year:
2008

Citing 14
Cited 4

Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction

Computational Linguistics
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Statistical phrase-based translation

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Statistical Machine Translation with Scarce Resources Using Morpho-syntactic Information

Computational Linguistics
Modelling lexical redundancy for machine translation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Improving statistical MT through morphological analysis

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Learning morphological disambiguation rules for Turkish

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Moses: open source toolkit for statistical machine translation

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Morphological analysis for statistical machine translation

HLT-NAACL-Short '04 Proceedings of HLT-NAACL 2004: Short Papers
Bridging the inflection morphology gap for Arabic statistical machine translation

NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
Exploring different representational units in English-to-Turkish statistical machine translation

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Meteor: an automatic metric for MT evaluation with high levels of correlation with human judgments

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Initial explorations in English to Turkish statistical machine translation

StatMT '06 Proceedings of the Workshop on Statistical Machine Translation
N-gram posterior probabilities for statistical machine translation

StatMT '06 Proceedings of the Workshop on Statistical Machine Translation

The SAWA corpus: a parallel corpus English - Swahili

AfLaT '09 Proceedings of the First Workshop on Language Technologies for African Languages
Syntax-to-morphology mapping in factored phrase-based statistical machine translation from English to Turkish

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Joint tokenization and translation

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Exploring the sawa corpus: collection and deployment of a parallel corpus English--Swahili

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present the results of our investigation into phrase-based statistical machine translation from English into Turkish - an agglutinative language with very productive inflectional and derivational word-formation processes. We investigate different representational granularities for morphological structure and find that (i) representing both Turkish and English at the morpheme-level but with some selective morpheme-grouping on the Turkish side of the training data, (ii) augmenting the training data with "sentences" comprising only the content words of the original training data to bias root word alignment, and with highly-reliable phrase-pairs from an earlier corpus-alignment (iii) re-ranking the n-best morpheme-sequence outputs of the decoder with a word-based language model, and (iv) "repairing" translated words with incorrect morphological structure and words which are out-of-vocabulary relative to the training and the language model corpus, provide an nontrivial improvement over a word-based baseline despite our very limited training data. We improve from 19.77 BLEU points for our word-based baseline model to 26.87 BLEU points for an improvement of 7.10 points or about 36% relative. We briefly discuss the applicability of BLEU to morphologically complex languages like Turkish and present a simple extension to compare tokens not in a all-or-none fashion but taking lexicalsemantic and morpho-semantic similarities into account, implemented in our BLEU+tool.