Significance tests of automatic machine translation evaluation metrics

Authors:
Ying Zhang;Stephan Vogel
Affiliations:
Carnegie Mellon University, Pittsburgh, USA;Carnegie Mellon University, Pittsburgh, USA
Venue:
Machine Translation
Year:
2010

Citing 8
Cited 5

A DP based search algorithm for statistical machine translation

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
QARLA: a framework for the evaluation of text summarization systems

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
ORANGE: a method for evaluating automatic evaluation metrics for machine translation

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
MT evaluation: human-like vs. human acceptable

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Evaluating machine translation with LFG dependencies

Machine Translation
(Meta-) evaluation of machine translation

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Meteor: an automatic metric for MT evaluation with high levels of correlation with human judgments

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Robust machine translation evaluation with entailment features

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1

Controlled Translation in an Example-based Environment: What do Automatic Evaluation Metrics Tell Us?

Machine Translation
Better hypothesis testing for statistical machine translation: controlling for optimizer instability

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Using sense-labeled discourse connectives for statistical machine translation

EACL 2012 Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)
The surface realisation task: recent developments and future plans

INLG '12 Proceedings of the Seventh International Natural Language Generation Conference
Automatically assessing machine summary content without a gold standard

Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU, METEOR and the related NIST metric, are becoming increasingly important in MT research and development. This paper presents a significance test-driven comparison of n-gram-based automatic MT evaluation metrics. Statistical significance tests use bootstrapping methods to estimate the reliability of automatic machine translation evaluations. Based on this reliability estimation, we study the characteristics of different MT evaluation metrics and how to construct reliable and efficient evaluation suites.