Mind the gap: dangers of divorcing evaluations of summary content from linguistic quality

Authors:
John M. Conroy;Hoa Trang Dang
Affiliations:
IDA/Center for Computing Sciences, Bowie, Maryland;National Institute of Standards and Technology, Gaithersburg, Maryland
Venue:
COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Year:
2008

Citing 4
Cited 12

Manual and automatic evaluation of summaries

AS '02 Proceedings of the ACL-02 Workshop on Automatic Summarization - Volume 4
Modeling local coherence: an entity-based approach

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
A dependency-based method for evaluating broad-coverage parsers

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Arabic/English multi-document summarization with CLASSY: the past and the future

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing

Summarization system evaluation revisited: N-gram graphs

ACM Transactions on Speech and Language Processing (TSLP)
Automatically evaluating content selection in summarization without human models

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
Automatic evaluation of linguistic quality in multi-document summarization

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Summarizing threads in blogs using opinion polarity

eETTs '09 Proceedings of the Workshop on Events in Emerging Text Types
Structural features for predicting the linguistic quality of text: applications to machine translation, automatic summarization and human-authored text

Empirical methods in natural language generation
Nouveau-rouge: A novelty metric for update summarization

Computational Linguistics
A novel approach to update summarization using evolutionary manifold-ranking and spectral clustering

Expert Systems with Applications: An International Journal
Text summarisation in progress: a literature review

Artificial Intelligence Review
Ranking human and machine summarization systems

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Combining coherence models and machine translation evaluation metrics for summarization evaluation

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Discrepancy between automatic and manual evaluation of summaries

Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization
Summary evaluation: together we stand NPowER-ed

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we analyze the state of current human and automatic evaluation of topic-focused summarization in the Document Understanding Conference main task for 2005--2007. The analyses show that while ROUGE has very strong correlation with responsiveness for both human and automatic summaries, there is a significant gap in responsiveness between humans and systems which is not accounted for by the ROUGE metrics. In addition to teasing out gaps in the current automatic evaluation, we propose a method to maximize the strength of current automatic evaluations by using the method of canonical correlation. We apply this new evaluation method, which we call ROSE (ROUGE Optimal Summarization Evaluation), to find the optimal linear combination of ROUGE scores to maximize correlation with human responsiveness.