Integration of multiple bilingually-learned segmentation schemes into statistical machine translation

Authors:
Michael Paul;Andrew Finch;Eiichiro Sumita
Affiliations:
National Institute of Information and Communications Technology, Keihanna Science City, Kyoto, Japan;National Institute of Information and Communications Technology, Keihanna Science City, Kyoto, Japan;National Institute of Information and Communications Technology, Keihanna Science City, Kyoto, Japan
Venue:
WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
Year:
2010

Citing 19
Cited 2

A maximum entropy approach to natural language processing

Computational Linguistics
Inducing Features of Random Fields

IEEE Transactions on Pattern Analysis and Machine Intelligence
A study on word-based and integral-bit Chinese text compression algorithms

Journal of the American Society for Information Science
An Efficient, Probabilistically Sound Algorithm for Segmentation andWord Discovery

Machine Learning - Special issue on natural language learning
A systematic comparison of various statistical alignment models

Computational Linguistics
A statistical model for word discovery in transcribed speech

Computational Linguistics
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Finding translation pairs from English-Japanese untokenized aligned corpora

S2S '02 Proceedings of the ACL-02 workshop on Speech-to-speech translation: algorithms and systems - Volume 7
Contextual dependencies in unsupervised word segmentation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Bayesian semi-supervised Chinese word segmentation for statistical machine translation

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Bilingually motivated domain-adapted word segmentation for statistical machine translation

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Using a maximum entropy model to build segmentation lattices for MT

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Minimum Bayes risk combination of translation hypotheses from alternative morphological decompositions

NAACL-Short '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
(Meta-) evaluation of machine translation

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Meteor: an automatic metric for MT evaluation with high levels of correlation with human judgments

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Improved statistical machine translation by multiple Chinese word segmentation

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Optimizing Chinese word segmentation for machine translation performance

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Unsupervised tokenization for machine translation

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Comparative study on corpora for speech translation

IEEE Transactions on Audio, Speech, and Language Processing

Enhancing statistical machine translation with character alignment

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
An empirical study on word segmentation for chinese machine translation

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes an unsupervised word segmentation algorithm that identifies word boundaries in continuous source language text in order to improve the translation quality of statistical machine translation (SMT) approaches. The method can be applied to any language pair where the source language is unseg-mented and the target language segmentation is known. First, an iterative bootstrap method is applied to learn multiple segmentation schemes that are consistent with the phrasal segmentations of an SMT system trained on the resegmented bitext. In the second step, multiple segmentation schemes are integrated into a single SMT system by characterizing the source language side and merging identical translation pairs of differently segmented SMT models. Experimental results translating five Asian languages into English revealed that the method of integrating multiple segmentation schemes outperforms SMT models trained on any of the learned word segmentations and performs comparably to available state-of-the-art monolingually-built segmentation tools.