A machine learning approach to the automatic evaluation of machine translation
ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
BLEU: a method for automatic evaluation of machine translation
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Precision and recall of machine translation
NAACL-Short '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short papers - Volume 2
ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Dependency treelet translation: syntactically informed phrasal SMT
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
QARLA: a framework for the evaluation of text summarization systems
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
ORANGE: a method for evaluating automatic evaluation metrics for machine translation
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
BLANC: learning evaluation metrics for MT
HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
MT evaluation: human-like vs. human acceptable
COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Stochastic iterative alignment for machine translation evaluation
COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Automatic evaluation of machine translation quality using n-gram co-occurrence statistics
HLT '02 Proceedings of the second international conference on Human Language Technology Research
Evaluating machine translation with LFG dependencies
Machine Translation
Dependency-based automatic evaluation for machine translation
SSST '07 Proceedings of the NAACL-HLT 2007/AMTA Workshop on Syntax and Structure in Statistical Translation
Word error rates: decomposition over Pos classes and applications for error analysis
StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Labelled dependencies in machine translation evaluation
StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Linguistic features for automatic evaluation of heterogenous MT systems
StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Further meta-evaluation of machine translation
StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Findings of the 2009 workshop on statistical machine translation
StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
A simple automatic MT evaluation metric
StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
DEPEVAL(summ): dependency-based evaluation for automatic summaries
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Robust machine translation evaluation with entailment features
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
The contribution of linguistic features to automatic machine translation evaluation
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
UNED: evaluating text similarity measures without human assessments
SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
A graphical interface for MT evaluation and error analysis
ACL '12 Proceedings of the ACL 2012 System Demonstrations
The heterogeneity principle in evaluation measures for automatic summarization
Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization
Hi-index | 0.00 |
Automatically produced texts (e.g. translations or summaries) are usually evaluated with n-gram based measures such as BLEU or ROUGE, while the wide set of more sophisticated measures that have been proposed in the last years remains largely ignored for practical purposes. In this paper we first present an in-depth analysis of the state of the art in order to clarify this issue. After this, we formalize and verify empirically a set of properties that every text evaluation measure based on similarity to human-produced references satisfies. These properties imply that corroborating system improvements with additional measures always increases the overall reliability of the evaluation process. In addition, the greater the heterogeneity of the measures (which is measurable) the higher their combined reliability. These results support the use of heterogeneous measures in order to consolidate text evaluation results.