Prediction of learning curves in machine translation

Authors:
Prasanth Kolachina;Nicola Cancedda;Marc Dymetman;Sriram Venkatapathy
Affiliations:
LTRC, IIIT-Hyderabad, Hyderabad, India;Xerox Research Centre Europe, Meylan, France;Xerox Research Centre Europe, Meylan, France;Xerox Research Centre Europe, Meylan, France
Venue:
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Year:
2012

Citing 10
Cited 0

Modelling Classification Performance for Large Data Sets

WAIM '01 Proceedings of the Second International Conference on Advances in Web-Age Information Management
Tree induction vs. logistic regression: a learning-curve analysis

The Journal of Machine Learning Research
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Statistical phrase-based translation

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Moses: open source toolkit for statistical machine translation

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Predicting success in machine translation

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Learning performance of a machine translation system: a statistical and computational analysis

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Stabilizing minimum error rate training

StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
Better hypothesis testing for statistical machine translation: controlling for optimizer instability

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Findings of the 2011 Workshop on Statistical Machine Translation

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Parallel data in the domain of interest is the key resource when training a statistical machine translation (SMT) system for a specific purpose. Since ad-hoc manual translation can represent a significant investment in time and money, a prior assesment of the amount of training data required to achieve a satisfactory accuracy level can be very useful. In this work, we show how to predict what the learning curve would look like if we were to manually translate increasing amounts of data. We consider two scenarios, 1) Monolingual samples in the source and target languages are available and 2) An additional small amount of parallel corpus is also available. We propose methods for predicting learning curves in both these scenarios.