The heterogeneity principle in evaluation measures for automatic summarization

Authors:
Enrique Amigó;Julio Gonzalo;Felisa Verdejo
Affiliations:
UNED, Madrid;UNED, Madrid;UNED, Madrid
Venue:
Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization
Year:
2012

Citing 4
Cited 1

An introduction to ROC analysis

Pattern Recognition Letters - Special issue: ROC analysis in pattern recognition
Kernel-based approach for automatic evaluation of natural language generation technologies: application to automatic summarization

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
DEPEVAL(summ): dependency-based evaluation for automatic summaries

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Corroborating text evaluation results with heterogeneous measures

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing

Summary evaluation: together we stand NPowER-ed

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

The development of summarization systems requires reliable similarity (evaluation) measures that compare system outputs with human references. A reliable measure should have correspondence with human judgements. However, the reliability of measures depends on the test collection in which the measure is meta-evaluated; for this reason, it has not yet been possible to reliably establish which are the best evaluation measures for automatic summarization. In this paper, we propose an unsupervised method called Heterogeneity-Based Ranking (HBR) that combines summarization evaluation measures without requiring human assessments. Our empirical results indicate that HBR achieves a similar correspondence with human assessments than the best single measure for every observed corpus. In addition, HBR results are more robust across topics than single measures.