A systematic comparison of various statistical alignment models
Computational Linguistics
The mathematics of statistical machine translation: parameter estimation
Computational Linguistics - Special issue on using large corpora: II
Discriminative training and maximum entropy models for statistical machine translation
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Pattern Recognition and Machine Learning (Information Science and Statistics)
Pattern Recognition and Machine Learning (Information Science and Statistics)
Moses: open source toolkit for statistical machine translation
ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Active learning for statistical phrase-based machine translation
NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Findings of the 2009 workshop on statistical machine translation
StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
Discriminative corpus weight estimation for machine translation
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
The two faces of active learning
ALT'09 Proceedings of the 20th international conference on Algorithmic learning theory
Intelligent selection of language model training data
ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
Discriminative instance weighting for domain adaptation in statistical machine translation
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Discriminative sample selection for statistical machine translation
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Domain adaptation via pseudo in-domain data selection
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Cost-sensitive active learning for computer-assisted translation
Pattern Recognition Letters
Hi-index | 0.00 |
Nowadays, there are large amounts of data available to train statistical machine translation systems. However, it is not clear whether all the training data actually help or not. A system trained on a subset of such huge bilingual corpora might outperform the use of all the bilingual data. This paper studies such issues by analysing two training data selection techniques: one based on approximating the probability of an indomain corpus; and another based on infrequent n-gram occurrence. Experimental results not only report significant improvements over random sentence selection but also an improvement over a system trained with the whole available data. Surprisingly, the improvements are obtained with just a small fraction of the data that accounts for less than 0.5% of the sentences. Afterwards, we show that a much larger room for improvement exists, although this is done under non-realistic conditions.