Generation of compound words in statistical machine translation into compounding languages

Authors:
Sara Stymne;Nicola Cancedda;Lars Ahrenberg
Affiliations:
Uppsala University;Xerox Research Centre Europe;Linköping University
Venue:
Computational Linguistics
Year:
2013

Citing 36
Cited 0

Implementing an efficient part-of-speech tagger

Software—Practice & Experience
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
A systematic comparison of various statistical alignment models

Computational Linguistics
Dictionary-Based Cross-Language Information Retrieval: Learning Experiences from CLEF 2000–2002

Information Retrieval
A practical part-of-speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
A non-projective dependency parser

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Improving SMT quality with morpho-syntactic analysis

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Empirical methods for compound splitting

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Minimum error rate training in statistical machine translation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Large Margin Methods for Structured and Interdependent Output Variables

The Journal of Machine Learning Research
Statistical Machine Translation with Scarce Resources Using Morpho-syntactic Information

Computational Linguistics
Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Discriminative Reranking for Natural Language Parsing

Computational Linguistics
Translating with non-contiguous phrases

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Automatic evaluation of machine translation quality using n-gram co-occurrence statistics

HLT '02 Proceedings of the second international conference on Human Language Technology Research
German Compounds in Factored Statistical Machine Translation

GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
Segmentation for English-to-Arabic statistical machine translation

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Moses: open source toolkit for statistical machine translation

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Estimation of conditional probabilities with decision trees and an application to fine-grained POS tagging

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
A comparison of merging strategies for translation of German compounds

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop
Predicting success in machine translation

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Using a maximum entropy model to build segmentation lattices for MT

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Getting to know Moses: initial experiments on German--English factored translation

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Meteor: an automatic metric for MT evaluation with high levels of correlation with human judgments

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Effects of morphological analysis in translation between German and English

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Towards better machine translation quality for the German--English language pairs

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Experiments in morphosyntactic processing for translating to and from German

StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
Initial explorations in English to Turkish statistical machine translation

StatMT '06 Proceedings of the Workshop on Statistical Machine Translation
Unsupervised and knowledge-free learning of compound splits and periphrases

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
How to avoid burning ducks: combining linguistic analysis and corpus statistics for German compound processing

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
Language-independent compound splitting with morphological operations

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Better hypothesis testing for statistical machine translation: controlling for optimizer instability

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Statistical machine translation of german compound words

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
A formal model of ambiguity and its applications in machine translation

A formal model of ambiguity and its applications in machine translation
Productive generation of compound words in statistical machine translation

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this article we investigate statistical machine translation SMT into Germanic languages, with a focus on compound processing. Our main goal is to enable the generation of novel compounds that have not been seen in the training data. We adopt a split-merge strategy, where compounds are split before training the SMT system, and merged after the translation step. This approach reduces sparsity in the training data, but runs the risk of placing translations of compound parts in non-consecutive positions. It also requires a postprocessing step of compound merging, where compounds are reconstructed in the translation output. We present a method for increasing the chances that components that should be merged are translated into contiguous positions and in the right order and show that it can lead to improvements both by direct inspection and in terms of standard translation evaluation metrics. We also propose several new methods for compound merging, based on heuristics and machine learning, which outperform previously suggested algorithms. These methods can produce novel compounds and a translation with at least the same overall quality as the baseline. For all subtasks we show that it is useful to include part-of-speech based information in the translation process, in order to handle compounds.