Does more data always yield better translations?

  • Authors:
  • Guillem Gascó;Martha-Alicia Rocha;Germán Sanchis-Trilles;Jesús Andrés-Ferrer;Francisco Casacuberta

  • Affiliations:
  • Universitat Politècnica de València, València, Spain;Universitat Politècnica de València, València, Spain;Universitat Politècnica de València, València, Spain;Universitat Politècnica de València, València, Spain;Universitat Politècnica de València, València, Spain

  • Venue:
  • EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Nowadays, there are large amounts of data available to train statistical machine translation systems. However, it is not clear whether all the training data actually help or not. A system trained on a subset of such huge bilingual corpora might outperform the use of all the bilingual data. This paper studies such issues by analysing two training data selection techniques: one based on approximating the probability of an indomain corpus; and another based on infrequent n-gram occurrence. Experimental results not only report significant improvements over random sentence selection but also an improvement over a system trained with the whole available data. Surprisingly, the improvements are obtained with just a small fraction of the data that accounts for less than 0.5% of the sentences. Afterwards, we show that a much larger room for improvement exists, although this is done under non-realistic conditions.