Evaluation of 2-way Iraqi Arabic---English speech translation systems using automated metrics

Authors:
Sherri Condon;Mark Arehart;Dan Parvaz;Gregory Sanders;Christy Doran;John Aberdeen
Affiliations:
The MITRE Corporation, McLean, USA;The MITRE Corporation, McLean, USA;The MITRE Corporation, Orlando, USA;National Institute of Standards and Technology, Gaithersburg, USA;The MITRE Corporation, Bedford, USA;The MITRE Corporation, Bedford, USA
Venue:
Machine Translation
Year:
2012

Citing 5
Cited 1

BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Some Improvements over the BLEU Metric for Measuring Translation Quality for Hindi

ICCTA '07 Proceedings of the International Conference on Computing: Theory and Applications
Arabic preprocessing schemes for statistical machine translation

NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
Dependency-based automatic evaluation for machine translation

SSST '07 Proceedings of the NAACL-HLT 2007/AMTA Workshop on Syntax and Structure in Statistical Translation
The NIST 2008 Metrics for machine translation challenge--overview, methodology, metrics, and results

Machine Translation

Evaluation methodology and metrics employed to assess the TRANSTAC two-way, speech-to-speech translation systems

Computer Speech and Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Defense Advanced Research Projects Agency (DARPA) Spoken Language Communication and Translation System for Tactical Use (TRANSTAC) program ( http://1.usa.gov/transtac ) faced many challenges in applying automated measures of translation quality to Iraqi Arabic---English speech translation dialogues. Features of speech data in general and of Iraqi Arabic data in particular undermine basic assumptions of automated measures that depend on matching system outputs to reference translations. These features are described along with the challenges they present for evaluating machine translation quality using automated metrics. We show that scores for translation into Iraqi Arabic exhibit higher correlations with human judgments when they are computed from normalized system outputs and reference translations. Orthographic normalization, lexical normalization, and operations involving light stemming resulted in higher correlations with human judgments.