Toward a unified approach to statistical language modeling for Chinese
ACM Transactions on Asian Language Information Processing (TALIP)
A systematic comparison of various statistical alignment models
Computational Linguistics
Minimum error rate training in statistical machine translation
ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Moses: open source toolkit for statistical machine translation
ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
CCG supertags in factored statistical machine translation
StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Mixture-model adaptation for SMT
StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Experiments in domain adaptation for statistical machine translation
StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Discriminative corpus weight estimation for machine translation
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Intelligent selection of language model training data
ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
Discriminative instance weighting for domain adaptation in statistical machine translation
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Does more data always yield better translations?
EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Adapting translation models to translationese improves SMT
EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Perplexity minimization for translation model domain adaptation in statistical machine translation
EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Maximum expected BLEU training of phrase and lexicon translation models
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Topic models for dynamic translation model adaptation
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
Translation model based cross-lingual language model adaptation: from word models to phrase models
EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Towards effective use of training data in statistical machine translation
WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Selecting data for English-to-Czech machine translation
WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Analysing the effect of out-of-domain data on SMT systems
WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Joint and coupled bilingual topic model based sentence representations for language model adaptation
IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Instance selection and instance weighting for cross-domain sentiment classification via PU learning
IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Improving statistical machine translation by adapting translation models to translationese
Computational Linguistics
Improving statistical machine translation by adapting translation models to translationese
Computational Linguistics
Hi-index | 0.00 |
We explore efficient domain adaptation for the task of statistical machine translation based on extracting sentences from a large general-domain parallel corpus that are most relevant to the target domain. These sentences may be selected with simple cross-entropy based methods, of which we present three. As these sentences are not themselves identical to the in-domain data, we call them pseudo in-domain subcorpora. These subcorpora -- 1% the size of the original -- can then used to train small domain-adapted Statistical Machine Translation (SMT) systems which outperform systems trained on the entire corpus. Performance is further improved when we use these domain-adapted models in combination with a true in-domain model. The results show that more training data is not always better, and that best results are attained via proper domain-relevant data selection, as well as combining in- and general-domain systems during decoding.