Robust machine translation evaluation with entailment features

Authors:
Sebastian Padó;Michel Galley;Dan Jurafsky;Chris Manning
Affiliations:
Stuttgart University;Stanford University;Stanford University;Stanford University
Venue:
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Year:
2009

Citing 15
Cited 18

BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Minimum error rate training in statistical machine translation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Methods for using textual entailment in open-domain question answering

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
ORANGE: a method for evaluating automatic evaluation metrics for machine translation

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Learning to recognize features of valid textual entailments

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Paraphrasing for automatic evaluation

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
MT evaluation: human-like vs. human acceptable

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Automatic evaluation of machine translation quality using n-gram co-occurrence statistics

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Evaluating machine translation with LFG dependencies

Machine Translation
Modeling semantic containment and exclusion in natural language inference

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Re-evaluating machine translation results with paraphrase support

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Further meta-evaluation of machine translation

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Textual entailment features for machine translation evaluation

StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
Fluency, adequacy, or HTER?: exploring different human judgments with a tunable MT metric

StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
The PASCAL recognising textual entailment challenge

MLCW'05 Proceedings of the First international conference on Machine Learning Challenges: evaluating Predictive Uncertainty Visual Object Classification, and Recognizing Textual Entailment

Significance tests of automatic machine translation evaluation metrics

Machine Translation
"Ask not what textual entailment can do for you..."

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
The DCU dependency-based metric in WMT-MetricsMATR 2010

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
TESLA: translation evaluation of sentences with linear-programming-based analysis

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
A survey of paraphrasing and textual entailment methods

Journal of Artificial Intelligence Research
Structured vs. flat semantic role representations for machine translation evaluation

SSST-5 Proceedings of the Fifth Workshop on Syntax, Semantics and Structure in Statistical Translation
AMBER: a modified BLEU, enhanced ranking metric

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Diversity-aware evaluation for paraphrase patterns

TIWTE '11 Proceedings of the TextInfer 2011 Workshop on Textual Entailment
A generate and rank approach to sentence paraphrasing

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Corroborating text evaluation results with heterogeneous measures

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
A new sentence compression dataset and its use in an abstractive generate-and-rank sentence compressor

UCNLG+EVAL '11 Proceedings of the UCNLG+Eval: Language Generation and Evaluation Workshop
Recognizing Inference in Texts with Markov Logic Networks

ACM Transactions on Asian Language Information Processing (TALIP) - Special Issue on RITE
UAlacant: using online machine translation for cross-lingual textual entailment

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
SAGAN: an approach to semantic textual similarity based on textual entailment

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
PORT: a precision-order-recall MT evaluation metric for tuning

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Unsupervised vs. supervised weight estimation for semantic MT evaluation metrics

SSST-6 '12 Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation
Semantic textual similarity for MT evaluation

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
SPEDE: probabilistic edit distance metrics for MT evaluation

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Existing evaluation metrics for machine translation lack crucial robustness: their correlations with human quality judgments vary considerably across languages and genres. We believe that the main reason is their inability to properly capture meaning: A good translation candidate means the same thing as the reference translation, regardless of formulation. We propose a metric that evaluates MT output based on a rich set of features motivated by textual entailment, such as lexical-semantic (in-)compatibility and argument structure overlap. We compare this metric against a combination metric of four state-of-the-art scores (BLEU, NIST, TER, and METEOR) in two different settings. The combination metric out-performs the individual scores, but is bested by the entailment-based metric. Combining the entailment and traditional features yields further improvements.