Train the machine with what it can learn: corpus selection for SMT

Authors:
Xiwu Han;Hanzhang Li;Tiejun Zhao
Affiliations:
Heilongjiang University, Harbin City, China;Heilongjiang University, Harbin City, China;Harbin Institute of Technology, Harbin City, China
Venue:
BUCC '09 Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora
Year:
2009

Citing 9
Cited 1

Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A systematic comparison of various statistical alignment models

Computational Linguistics
Head-driven statistical models for natural language parsing

Head-driven statistical models for natural language parsing
The Web as a parallel corpus

Computational Linguistics - Special issue on web as corpus
Discovering parallel text from the World Wide Web

ACSW Frontiers '04 Proceedings of the second workshop on Australasian information security, Data Mining and Web Intelligence, and Software Internationalisation - Volume 32
Improving Machine Translation Performance by Exploiting Non-Parallel Corpora

Computational Linguistics
Extracting parallel sub-sentential fragments from non-parallel corpora

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Language model adaptation for statistical machine translation with structured query models

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Distributed language modeling for N-best list re-ranking

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

TEP: Tehran English-Persian parallel corpus

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

Statistical machine translation relies heavily on available parallel corpora, but SMT may not have the ability or intelligence to make full use of the training set. Instead of collecting more and more parallel training corpora, this paper aims to improve SMT performance by exploiting the full potential of existing parallel corpora. We first identify literally translated sentence pairs via lexical and grammatical compatibility, and then use these data to train SMT models. One experiment indicates that larger training corpora do not always lead to higher decoding performance when the added data are not literal translations. And another experiment shows that properly enlarging the contribution of literal translation can improve SMT performance significantly.