Using TectoMT as a preprocessing tool for phrase-based statistical machine translation

Authors:
Daniel Zeman
Affiliations:
Univerzita Karlova v Praze, ÚFAL, Praha, Czechia
Venue:
TSD'10 Proceedings of the 13th international conference on Text, speech and dialogue
Year:
2010

Citing 11
Cited 0

Statistical Machine Translation with Scarce Resources Using Morpho-syntactic Information

Computational Linguistics
Clause restructuring for statistical machine translation

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Non-projective dependency parsing using spanning tree algorithms

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Improving statistical MT through morphological analysis

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Arabic preprocessing schemes for statistical machine translation

NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
The University of Washington machine translation system for ACL WMT 2008

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
TectoMT: highly modular MT system with tectogrammatics used as transfer layer

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
The RWTH machine translation system for WMT 2009

StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
Augmenting a small parallel text with morpho-syntactic language resources for Serbian-English statistical machine translation

ParaText '05 Proceedings of the ACL Workshop on Building and Using Parallel Texts
Morpho-syntactic Arabic preprocessing for Arabic-to-English statistical machine translation

StatMT '06 Proceedings of the Workshop on Statistical Machine Translation
Case markers and morphology: addressing the crux of the fluency problem in English-Hindi SMT

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a systematic comparison of preprocessing techniques for two language pairs: English-Czech and English-Hindi. The two target languages, although both belonging to the Indo-European language family, show significant differences in morphology, syntax and word order. We describe how TectoMT, a successful framework for analysis and generation of language, can be used as preprocessor for a phrase-based MT system. We compare the two language pairs and the optimal sets of source-language transformations applied to them. The following transformations are examples of possible preprocessing steps: lemmatization; retokenization, compound splitting; removing/adding words lacking counterparts in the other language; phrase reordering to resemble the target word order; marking syntactic functions. TectoMT, as well as all other tools and data sets we use, are freely available on the Web.