Building a web-based parallel corpus and filtering out machine-translated text

  • Authors:
  • Alexandra Antonova;Alexey Misyurev

  • Affiliations:
  • Yandex, Moscow, Russia;Yandex, Moscow, Russia

  • Venue:
  • BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

We describe a set of techniques that have been developed while collecting parallel texts for Russian-English language pair and building a corpus of parallel sentences for training a statistical machine translation system. We discuss issues of verifying potential parallel texts and filtering out automatically translated documents. Finally we evaluate the quality of the 1-million-sentence corpus which we believe may be a useful resource for machine translation research.