Building a statistical machine translation system from scratch: how much bang for the buck can we expect?

  • Authors:
  • Ulrich Germann

  • Affiliations:
  • USC Information Sciences Institute, Marina del Rey, CA

  • Venue:
  • DMMT '01 Proceedings of the workshop on Data-driven methods in machine translation - Volume 14
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

We report on our experience with building a statistical MT system from scratch, including the creation of a small parallel Tamil-English corpus, and the results of a task-based pilot evaluation of statistical MT systems trained on sets of ca. 1300 and ca. 5000 parallel sentences of Tamil and English data. Our results show that even with apparently incomprehensible system output, humans without any knowledge of Tamil can achieve performance rates as high as 86% accuracy for topic identification, 93% recall for document retrieval, and 64% recall on question answering (plus an additional 14% partially correct answers).