Building a statistical machine translation system from scratch: how much bang for the buck can we expect?

Authors:
Ulrich Germann
Affiliations:
USC Information Sciences Institute, Marina del Rey, CA
Venue:
DMMT '01 Proceedings of the workshop on Data-driven methods in machine translation - Volume 14
Year:
2001

Citing 2
Cited 10

The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Fast decoding and optimal decoding for machine translation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics

Translation by the Numbers: Language Weaver

AMTA '02 Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users
Bootstrapping the Lexicon Building Process for Machine Translation between `New' Languages

AMTA '02 Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users
Classification Approach to Word Selection in Machine Translation

AMTA '02 Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users
Cross-lingual C*ST*RD: English access to Hindi information

ACM Transactions on Asian Language Information Processing (TALIP)
Statistical machine translation with word- and sentence-aligned parallel corpora

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Crowdsourcing translation: professional quality from non-professionals

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Parallel sentence generation from comparable corpora for improved SMT

Machine Translation
An iterative stemmer for tamil language

ACIIDS'12 Proceedings of the 4th Asian conference on Intelligent Information and Database Systems - Volume Part III
Toward statistical machine translation without parallel corpora

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Constructing parallel corpora for six Indian languages via crowdsourcing

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation

Quantified Score

Hi-index	0.00

Visualization

Abstract

We report on our experience with building a statistical MT system from scratch, including the creation of a small parallel Tamil-English corpus, and the results of a task-based pilot evaluation of statistical MT systems trained on sets of ca. 1300 and ca. 5000 parallel sentences of Tamil and English data. Our results show that even with apparently incomprehensible system output, humans without any knowledge of Tamil can achieve performance rates as high as 86% accuracy for topic identification, 93% recall for document retrieval, and 64% recall on question answering (plus an additional 14% partially correct answers).