A maximum entropy approach to natural language processing
Computational Linguistics
Inducing Features of Random Fields
IEEE Transactions on Pattern Analysis and Machine Intelligence
A study on word-based and integral-bit Chinese text compression algorithms
Journal of the American Society for Information Science
An Efficient, Probabilistically Sound Algorithm for Segmentation andWord Discovery
Machine Learning - Special issue on natural language learning
A systematic comparison of various statistical alignment models
Computational Linguistics
A statistical model for word discovery in transcribed speech
Computational Linguistics
BLEU: a method for automatic evaluation of machine translation
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Contextual dependencies in unsupervised word segmentation
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Bayesian semi-supervised Chinese word segmentation for statistical machine translation
COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Multilingual mobile-phone translation services for world travelers
COLING '08 22nd International Conference on on Computational Linguistics: Demonstration Papers
Optimizing Chinese word segmentation for machine translation performance
StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Hi-index | 0.00 |
This paper proposes an unsupervised word segmentation algorithm that identifies word boundaries in continuous text in order to optimize the translation quality of statistical machine translation (SMT) approaches. The proposed method is language-independent and uses a parallel corpus to align source language characters to the corresponding word units separated by whitespace in the target language. Successive characters aligned to the same target words are merged to a larger source language unit and a Maximum Entropy (ME) algorithm is applied to learn the word segmentation that optimizes the translation quality of an SMT system trained on the re-segmented bitext. Experimental results translating five Asian languages into English revealed that the proposed method outperforms a baseline system that translates unigram segmented source language sentences.