METEOR, M-BLEU and M-TER: evaluation metrics for high-correlation with human rankings of machine translation output

Authors:
Abhaya Agarwal;Alon Lavie
Affiliations:
Carnegie Mellon University, Pittsburgh, PA;Carnegie Mellon University, Pittsburgh, PA
Venue:
StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Year:
2008

Citing 8
Cited 18

Information Retrieval

Information Retrieval
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Precision and recall of machine translation

NAACL-Short '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short papers - Volume 2
ORANGE: a method for evaluating automatic evaluation metrics for machine translation

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Stochastic iterative alignment for machine translation evaluation

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
(Meta-) evaluation of machine translation

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Meteor: an automatic metric for MT evaluation with high levels of correlation with human judgments

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Sentence level machine translation evaluation as a ranking problem: one step aside from BLEU

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation

Further meta-evaluation of machine translation

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Findings of the 2009 workshop on statistical machine translation

StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
A simple automatic MT evaluation metric

StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
Improving alignment for SMT by reordering and augmenting the training corpus

StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
The back-translation score: automatic MT evaluation at the sentence level without reference translations

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
The Meteor metric for automatic evaluation of machine translation

Machine Translation
Extending the meteor machine translation evaluation metric to the phrase level

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
The best lexical metric for phrase-based statistical MT system optimization

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Evaluating machine translations using mNCD

ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
Vs and OOVs: two problems for translation between German and English

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
Normalized compression distance based measures for MetricsMATR 2010

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
Evaluating N-gram based evaluation metrics for automatic keyphrase extraction

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Reordering metrics for MT

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
The importance of visual context clues in multimedia translation

CLEF'11 Proceedings of the Second international conference on Multilingual and multimodal information access evaluation
Automatic categorization for improving Spanish into Spanish Sign Language machine translation

Computer Speech and Language
Experiments with word alignment, normalization and clause reordering for SMT between English and German

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Evaluation of arabic machine translation system based on the universal networking language

NLDB'09 Proceedings of the 14th international conference on Applications of Natural Language to Information Systems
Textual evidence gathering and analysis

IBM Journal of Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes our submissions to the machine translation evaluation shared task in ACL WMT-08. Our primary submission is the Meteor metric tuned for optimizing correlation with human rankings of translation hypotheses. We show significant improvement in correlation as compared to the earlier version of metric which was tuned to optimized correlation with traditional adequacy and fluency judgments. We also describe m-bleu and m-ter, enhanced versions of two other widely used metrics bleu and ter respectively, which extend the exact word matching used in these metrics with the flexible matching based on stemming and Wordnet in Meteor.