Language independent word segmentation for statistical machine translation

Authors:
Michael Paul;Andrew Finch;Eiichiro Sumita
Affiliations:
National Institute of Information and Communications Technology (NICT), Kyoto, Japan;National Institute of Information and Communications Technology (NICT), Kyoto, Japan;National Institute of Information and Communications Technology (NICT), Kyoto, Japan
Venue:
Proceedings of the 3rd International Universal Communication Symposium
Year:
2009

Citing 11
Cited 0

A maximum entropy approach to natural language processing

Computational Linguistics
Inducing Features of Random Fields

IEEE Transactions on Pattern Analysis and Machine Intelligence
A study on word-based and integral-bit Chinese text compression algorithms

Journal of the American Society for Information Science
An Efficient, Probabilistically Sound Algorithm for Segmentation andWord Discovery

Machine Learning - Special issue on natural language learning
A systematic comparison of various statistical alignment models

Computational Linguistics
A statistical model for word discovery in transcribed speech

Computational Linguistics
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Contextual dependencies in unsupervised word segmentation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Bayesian semi-supervised Chinese word segmentation for statistical machine translation

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Multilingual mobile-phone translation services for world travelers

COLING '08 22nd International Conference on on Computational Linguistics: Demonstration Papers
Optimizing Chinese word segmentation for machine translation performance

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes an unsupervised word segmentation algorithm that identifies word boundaries in continuous text in order to optimize the translation quality of statistical machine translation (SMT) approaches. The proposed method is language-independent and uses a parallel corpus to align source language characters to the corresponding word units separated by whitespace in the target language. Successive characters aligned to the same target words are merged to a larger source language unit and a Maximum Entropy (ME) algorithm is applied to learn the word segmentation that optimizes the translation quality of an SMT system trained on the re-segmented bitext. Experimental results translating five Asian languages into English revealed that the proposed method outperforms a baseline system that translates unigram segmented source language sentences.