Task-based evaluation of text summarization using Relevance Prediction

Authors:
Stacy President Hobson;Bonnie J. Dorr;Christof Monz;Richard Schwartz
Affiliations:
Department of Computer Science and UMIACS, University of Maryland, College Park, MD 20742, United States;Department of Computer Science and UMIACS, University of Maryland, College Park, MD 20742, United States;Department of Computer Science, Queen Mary, University of London, London E1 4NS, UK;BBN Technologies, Columbia, MD 21046, United States
Venue:
Information Processing and Management: an International Journal
Year:
2007

Citing 15
Cited 3

Constructing literature abstracts by computer: techniques and prospects

Information Processing and Management: an International Journal - Special issue on natural language processing and information retrieval
Automatic condensation of electronic publications by sentence selection

Information Processing and Management: an International Journal - Special issue: summarizing text
Assessing agreement on classification tasks: the kappa statistic

Computational Linguistics
Advantages of query biased summaries in information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
New Methods in Automatic Extracting

Journal of the ACM (JACM)
Summary evaluation and text categorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
SUMMAC: a text summarization evaluation

Natural Language Engineering
The kappa statistic: a second look

Computational Linguistics
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Automatic evaluation of summaries using N-gram co-occurrence statistics

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
A web-trained extraction summarization system

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Hedge Trimmer: a parse-and-trim approach to headline generation

HLT-NAACL-DUC '03 Proceedings of the HLT-NAACL 03 on Text summarization workshop - Volume 5
ORANGE: a method for evaluating automatic evaluation metrics for machine translation

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Automatically evaluating answers to definition questions

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
ParaEval: using paraphrases to evaluate summaries automatically

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics

GA, MR, FFNN, PNN and GMM based models for automatic text summarization

Computer Speech and Language
Exploring correlation between ROUGE and human evaluation on meeting summaries

IEEE Transactions on Audio, Speech, and Language Processing
Sumstega: summarisation-based steganography methodology

International Journal of Information and Computer Security

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article introduces a new task-based evaluation measure called Relevance Prediction that is a more intuitive measure of an individual's performance on a real-world task than interannotator agreement. Relevance Prediction parallels what a user does in the real world task of browsing a set of documents using standard search tools, i.e., the user judges relevance based on a short summary and then that same user-not an independent user-decides whether to open (and judge) the corresponding document. This measure is shown to be a more reliable measure of task performance than LDC Agreement, a current gold-standard based measure used in the summarization evaluation community. Our goal is to provide a stable framework within which developers of new automatic measures may make stronger statistical statements about the effectiveness of their measures in predicting summary usefulness. We demonstrate-as a proof-of-concept methodology for automatic metric developers-that a current automatic evaluation measure has a better correlation with Relevance Prediction than with LDC Agreement and that the significance level for detected differences is higher for the former than for the latter.