The use of MMR, diversity-based reranking for reordering documents and producing summaries
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating Natural Language Processing Systems: An Analysis and Review
Evaluating Natural Language Processing Systems: An Analysis and Review
Centroid-based summarization of multiple documents
Information Processing and Management: an International Journal
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Task-based evaluation of text summarization using Relevance Prediction
Information Processing and Management: an International Journal
Correlation between ROUGE and human evaluation of extractive meeting summaries
HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
A skip-chain conditional random field for ranking meeting utterances by importance
EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
AskHERMES: An online question answering system for complex clinical questions
Journal of Biomedical Informatics
Why is "SXSW" trending?: exploring multiple text sources for Twitter topic summarization
LSM '11 Proceedings of the Workshop on Languages in Social Media
An assessment of the accuracy of automatic evaluation in summarization
Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization
Extractive speech summarization using evaluation metric-related training criteria
Information Processing and Management: an International Journal
Hi-index | 0.00 |
Automatic summarization evaluation is very important to the development of summarization systems. In text summarization, ROUGE has been shown to correlate well with human evaluation when measuring match of content units. However, there are many characteristics of the multiparty meeting domain, which may pose potential problems to ROUGE. The goal of this paper is to examine howwell theROUGEscores correlate with human evaluation for extractive meeting summarization, and explore different meeting domain specific factors that have an impact on the correlation. More analysis than those in our previous work [1] has been conducted in this study. Our experiments show that generally the correlation between ROUGE and human evaluation is not great; however, when accounting for several unique meeting characteristics, such as disfluencies, speaker information, and stopwords in the ROUGE setting, better correlation can be achieved, especially on the system summaries. We also found that these factors have a different impact on human versus system summaries. In addition, we contrast the results using ROUGE with other automatic summarization evaluation metrics, such as Kappa and Pyramid, and show the appropriateness of using ROUGE for this study.