Corroborating text evaluation results with heterogeneous measures

Authors:
Enrique Amigó;Julio Gonzalo;Jesús Giménez;Felisa Verdejo
Affiliations:
UNED, Madrid;UNED, Madrid;UPC, Barcelona;UNED, Madrid
Venue:
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Year:
2011

Citing 25
Cited 3

A machine learning approach to the automatic evaluation of machine translation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Precision and recall of machine translation

NAACL-Short '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short papers - Volume 2
Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Dependency treelet translation: syntactically informed phrasal SMT

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
QARLA: a framework for the evaluation of text summarization systems

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
ORANGE: a method for evaluating automatic evaluation metrics for machine translation

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Kernel-based approach for automatic evaluation of natural language generation technologies: application to automatic summarization

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
BLANC: learning evaluation metrics for MT

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
MT evaluation: human-like vs. human acceptable

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Stochastic iterative alignment for machine translation evaluation

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Automatic evaluation of machine translation quality using n-gram co-occurrence statistics

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Evaluating machine translation with LFG dependencies

Machine Translation
Dependency-based automatic evaluation for machine translation

SSST '07 Proceedings of the NAACL-HLT 2007/AMTA Workshop on Syntax and Structure in Statistical Translation
Word error rates: decomposition over Pos classes and applications for error analysis

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Labelled dependencies in machine translation evaluation

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Linguistic features for automatic evaluation of heterogenous MT systems

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Further meta-evaluation of machine translation

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Findings of the 2009 workshop on statistical machine translation

StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
A simple automatic MT evaluation metric

StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
DEPEVAL(summ): dependency-based evaluation for automatic summaries

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Robust machine translation evaluation with entailment features

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
The contribution of linguistic features to automatic machine translation evaluation

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Expected dependency pair match: predicting translation quality with expected syntactic structure

Machine Translation
Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

UNED: evaluating text similarity measures without human assessments

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
A graphical interface for MT evaluation and error analysis

ACL '12 Proceedings of the ACL 2012 System Demonstrations
The heterogeneity principle in evaluation measures for automatic summarization

Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatically produced texts (e.g. translations or summaries) are usually evaluated with n-gram based measures such as BLEU or ROUGE, while the wide set of more sophisticated measures that have been proposed in the last years remains largely ignored for practical purposes. In this paper we first present an in-depth analysis of the state of the art in order to clarify this issue. After this, we formalize and verify empirically a set of properties that every text evaluation measure based on similarity to human-produced references satisfies. These properties imply that corroborating system improvements with additional measures always increases the overall reliability of the evaluation process. In addition, the greater the heterogeneity of the measures (which is measurable) the higher their combined reliability. These results support the use of heterogeneous measures in order to consolidate text evaluation results.