Correlation between ROUGE and human evaluation of extractive meeting summaries

Authors:
Feifan Liu;Yang Liu
Affiliations:
The University of Texas at Dallas, Richardson, TX;The University of Texas at Dallas, Richardson, TX
Venue:
HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Year:
2008

Citing 5
Cited 12

The use of MMR, diversity-based reranking for reordering documents and producing summaries

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating Natural Language Processing Systems: An Analysis and Review

Evaluating Natural Language Processing Systems: An Analysis and Review
Centroid-based summarization of multiple documents

Information Processing and Management: an International Journal
A skip-chain conditional random field for ranking meeting utterances by importance

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Comparing the roles of textual, acoustic and spoken-language features on spontaneous-conversation summarization

NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers

Evaluation of the clinical question answering presentation

BioNLP '09 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing
Detecting the noteworthiness of utterances in human meetings

SIGDIAL '09 Proceedings of the SIGDIAL 2009 Conference: The 10th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Improving supervised learning for meeting summarization using sampling and regression

Computer Speech and Language
Exploring correlation between ROUGE and human evaluation on meeting summaries

IEEE Transactions on Audio, Speech, and Language Processing
Long story short - Global unsupervised models for keyphrase based meeting summarization

Speech Communication
Using the Amazon Mechanical Turk to transcribe and annotate meeting speech for extractive summarization

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Textual properties and task based evaluation: investigating the role of surface properties, structure and content

INLG '10 Proceedings of the 6th International Natural Language Generation Conference
Learning to model domain-specific utterance sequences for extractive summarization of contact center dialogues

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Multi-topical discussion summarization using structured lexical chains and cue words

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
Active learning with semi-automatic annotation for extractive speech summarization

ACM Transactions on Speech and Language Processing (TSLP)
Text summarisation in progress: a literature review

Artificial Intelligence Review
A zipf-like distant supervision approach for multi-document summarization using wikinews articles

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic summarization evaluation is critical to the development of summarization systems. While ROUGE has been shown to correlate well with human evaluation for content match in text summarization, there are many characteristics in multiparty meeting domain, which may pose potential problems to ROUGE. In this paper, we carefully examine how well the ROUGE scores correlate with human evaluation for extractive meeting summarization. Our experiments show that generally the correlation is rather low, but a significantly better correlation can be obtained by accounting for several unique meeting characteristics, such as disfluencies and speaker information, especially when evaluating system-generated summaries.