Statistical machine translation of german compound words

Authors:
Maja Popović;Daniel Stein;Hermann Ney
Affiliations:
Lehrstuhl für Informatik VI – Computer Science Department, RWTH Aachen University, Aachen, Germany;Lehrstuhl für Informatik VI – Computer Science Department, RWTH Aachen University, Aachen, Germany;Lehrstuhl für Informatik VI – Computer Science Department, RWTH Aachen University, Aachen, Germany
Venue:
FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Year:
2006

Citing 7
Cited 15

Phrase-Based Statistical Machine Translation

KI '02 Proceedings of the 25th Annual German Conference on AI: Advances in Artificial Intelligence
Improving SMT quality with morpho-syntactic analysis

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Empirical methods for compound splitting

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Extensions to HMM-based statistical word alignment models

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Improving word alignment quality using morpho-syntactic information

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Shared task: statistical machine translation between European languages

ParaText '05 Proceedings of the ACL Workshop on Building and Using Parallel Texts

German Compounds in Factored Statistical Machine Translation

GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
A comparison of merging strategies for translation of German compounds

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop
Effects of morphological analysis in translation between German and English

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
The RWTH machine translation system for WMT 2009

StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
The RWTH Aachen machine translation system for WMT 2010

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
How to avoid burning ducks: combining linguistic analysis and corpus statistics for German compound processing

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
Preliminary study into query translation for patent retrieval

PaIR '10 Proceedings of the 3rd international workshop on Patent information retrieval
Pre- and postprocessing for statistical machine translation into Germanic languages

HLT-SS '11 Proceedings of the ACL 2011 Student Session
Recursive decompounding in Afrikaans

TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue
Productive generation of compound words in statistical machine translation

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
The RWTH Aachen machine translation system for WMT 2011

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Modeling inflection and word-formation in SMT

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Joint WMT 2012 submission of the QUAERO project

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Analysis, preparation, and optimization of statistical sign language machine translation

Machine Translation
Generation of compound words in statistical machine translation into compounding languages

Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

German compound words pose special problems to statistical machine translation systems: the occurence of each of the components in the training data is not sufficient for successful translation. Even if the compound itself has been seen during training, the system may not be capable of translating it properly into two or more words. If German is the target language, the system might generate only separated components or may not be capable of choosing the correct compound. In this work, we investigate and compare different strategies for the treatment of German compound words in statistical machine translation systems. For translation from German, we compare linguistic-based and corpus-based compound splitting. For translation into German, we investigate splitting and rejoining German compounds, as well as joining English potential components. Additionaly, we investigate word alignments enhanced with knowledge about the splitting points of German compounds. The translation quality is consistently improved by all methods for both translation directions.