A semi-supervised batch-mode active learning strategy for improved statistical machine translation

  • Authors:
  • Sankaranarayanan Ananthakrishnan;Rohit Prasad;David Stallard;Prem Natarajan

  • Affiliations:
  • BBN Technologies, Cambridge, MA;BBN Technologies, Cambridge, MA;BBN Technologies, Cambridge, MA;BBN Technologies, Cambridge, MA

  • Venue:
  • CoNLL '10 Proceedings of the Fourteenth Conference on Computational Natural Language Learning
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

The availability of substantial, in-domain parallel corpora is critical for the development of high-performance statistical machine translation (SMT) systems. Such corpora, however, are expensive to produce due to the labor intensive nature of manual translation. We propose to alleviate this problem with a novel, semi-supervised, batch-mode active learning strategy that attempts to maximize indomain coverage by selecting sentences, which represent a balance between domain match, translation difficulty, and batch diversity. Simulation experiments on an English-to-Pashto translation task show that the proposed strategy not only outperforms the random selection baseline, but also traditional active learning techniques based on dissimilarity to existing training data. Our approach achieves a relative improvement of 45.9% in BLEU over the seed baseline, while the closest competitor gained only 24.8% with the same number of selected sentences.