Measuring machine translation quality as semantic equivalence: A metric based on entailment features

Authors:
Sebastian Padó;Daniel Cer;Michel Galley;Dan Jurafsky;Christopher D. Manning
Affiliations:
Stuttgart University, Stuttgart, Germany;Stanford University, Stanford, USA;Stanford University, Stanford, USA;Stanford University, Stanford, USA;Stanford University, Stanford, USA
Venue:
Machine Translation
Year:
2009

Citing 17
Cited 12

A systematic comparison of various statistical alignment models

Computational Linguistics
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Statistical phrase-based translation

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Minimum error rate training in statistical machine translation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
ORANGE: a method for evaluating automatic evaluation metrics for machine translation

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Learning to recognize features of valid textual entailments

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Paraphrasing for automatic evaluation

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
MT evaluation: human-like vs. human acceptable

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Automatic evaluation of machine translation quality using n-gram co-occurrence statistics

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Evaluating machine translation with LFG dependencies

Machine Translation
Moses: open source toolkit for statistical machine translation

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Re-evaluating machine translation results with paraphrase support

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Regularization and search for minimum error rate training

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Further meta-evaluation of machine translation

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Textual entailment features for machine translation evaluation

StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
The PASCAL recognising textual entailment challenge

MLCW'05 Proceedings of the First international conference on Machine Learning Challenges: evaluating Predictive Uncertainty Visual Object Classification, and Recognizing Textual Entailment

LRscore for evaluating lexical and reordering quality in MT

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
Linguistic measures for automatic machine translation evaluation

Machine Translation
Reordering metrics for MT

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
TINE: a metric to assess MT adequacy

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Better evaluation metrics lead to better machine translation

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
HDU: cross-lingual textual entailment with SMT features

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
Stanford: probabilistic edit distance metrics for STS

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
SAGAN: an approach to semantic textual similarity based on textual entailment

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
Learning to translate with multiple objectives

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Semantic textual similarity for MT evaluation

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
An abstractive approach to sentence compression

ACM Transactions on Intelligent Systems and Technology (TIST) - Special Sections on Paraphrasing; Intelligent Systems for Socially Aware Computing; Social Computing, Behavioral-Cultural Modeling, and Prediction
Fusion of word and letter based metrics for automatic MT evaluation

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Current evaluation metrics for machine translation have increasing difficulty in distinguishing good from merely fair translations. We believe the main problem to be their inability to properly capture meaning: A good translation candidate means the same thing as the reference translation, regardless of formulation. We propose a metric that assesses the quality of MT output through its semantic equivalence to the reference translation, based on a rich set of match and mismatch features motivated by textual entailment. We first evaluate this metric in an evaluation setting against a combination metric of four state-of-the-art scores. Our metric predicts human judgments better than the combination metric. Combining the entailment and traditional features yields further improvements. Then, we demonstrate that the entailment metric can also be used as learning criterion in minimum error rate training (MERT) to improve parameter estimation in MT system training. A manual evaluation of the resulting translations indicates that the new model obtains a significant improvement in translation quality.