Discriminative sample selection for statistical machine translation

Authors:
Sankaranarayanan Ananthakrishnan;Rohit Prasad;David Stallard;Prem Natarajan
Affiliations:
Raytheon BBN Technologies, Cambridge, MA;Raytheon BBN Technologies, Cambridge, MA;Raytheon BBN Technologies, Cambridge, MA;Raytheon BBN Technologies, Cambridge, MA
Venue:
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Year:
2010

Citing 9
Cited 2

On the limited memory BFGS method for large scale optimization

Mathematical Programming: Series A and B
Active learning for statistical natural language parsing

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Statistical phrase-based translation

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Sample Selection for Statistical Parsing

Computational Linguistics
Multi-criteria-based active learning for named entity recognition

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Error-driven active learning in growing radial basis function networks for early robot learning

Neurocomputing
Active learning for statistical phrase-based machine translation

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Active learning with statistical models

Journal of Artificial Intelligence Research
A semi-supervised batch-mode active learning strategy for improved statistical machine translation

CoNLL '10 Proceedings of the Fourteenth Conference on Computational Natural Language Learning

Instance selection for machine translation using feature decay algorithms

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Does more data always yield better translations?

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Production of parallel training corpora for the development of statistical machine translation (SMT) systems for resource-poor languages usually requires extensive manual effort. Active sample selection aims to reduce the labor, time, and expense incurred in producing such resources, attaining a given performance benchmark with the smallest possible training corpus by choosing informative, nonredundant source sentences from an available candidate pool for manual translation. We present a novel, discriminative sample selection strategy that preferentially selects batches of candidate sentences with constructs that lead to erroneous translations on a held-out development set. The proposed strategy supports a built-in diversity mechanism that reduces redundancy in the selected batches. Simulation experiments on English-to-Pashto and Spanish-to-English translation tasks demonstrate the superiority of the proposed approach to a number of competing techniques, such as random selection, dissimilarity-based selection, as well as a recently proposed semi-supervised active learning strategy.