Selecting data for English-to-Czech machine translation

  • Authors:
  • Aleš Tamchyna;Petra Galuščáková;Amir Kamran;Miloš Stanojević;Ondřej Bojar

  • Affiliations:
  • Charles University in Prague, Praha, CZ, Czech Republic;Charles University in Prague, Praha, CZ, Czech Republic;Charles University in Prague, Praha, CZ, Czech Republic;Charles University in Prague, Praha, CZ, Czech Republic;Charles University in Prague, Praha, CZ, Czech Republic

  • Venue:
  • WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

We provide a few insights on data selection for machine translation. We evaluate the quality of the new CzEng 1.0, a parallel data source used in WMT12. We describe a simple technique for reducing out-of-vocabulary rate after phrase extraction. We discuss the benefits of tuning towards multiple reference translations for English-Czech language pair. We introduce a novel approach to data selection by full-text indexing and search: we select sentences similar to the test set from a large monolingual corpus and explore several options of incorporating them in a machine translation system. We show that this method can improve translation quality. Finally, we describe our submitted system CU-TAMCH-BOJ.