Automatically assessing machine summary content without a gold standard

Authors:
Annie Louis;Ani Nenkova
Affiliations:
University of Pennsylvania;University of Pennsylvania
Venue:
Computational Linguistics
Year:
2013

Citing 28
Cited 0

The use of MMR, diversity-based reranking for reordering documents and producing summaries

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Generic text summarization using relevance measure and latent semantic analysis

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Ranking retrieval systems without relevance judgments

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Text summarization via hidden Markov models

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Summarization evaluation using relative utility

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
The automated acquisition of topic signatures for text summarization

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Centroid-based summarization of multiple documents

Information Processing and Management: an International Journal
Automatic evaluation of summaries using N-gram co-occurrence statistics

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Examining the consensus between human summaries: initial experiments with factoid analysis

HLT-NAACL-DUC '03 Proceedings of the HLT-NAACL 03 on Text summarization workshop - Volume 5
A compositional context sensitive multi-document summarizer: exploring the factors that influence summarization

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
An information-theoretic approach to automatic evaluation of summaries

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
The Pyramid Method: Incorporating human content selection variation in summarization evaluation

ACM Transactions on Speech and Language Processing (TSLP)
Topic-focused multi-document summarization using an approximate oracle score

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
A comparison of rankings produced by summarization evaluation measures

NAACL-ANLP-AutoSum '00 Proceedings of the 2000 NAACL-ANLP Workshop on Automatic Summarization
Performance confidence estimation for automatic summarization

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
A scalable global model for summarization

ILP '09 Proceedings of the Workshop on Integer Linear Programming for Natural Langauge Processing
Lattice Minimum Bayes-Risk decoding for statistical machine translation

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Exploring content models for multi-document summarization

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Using paraphrases for parameter tuning in statistical machine translation

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
The role of pseudo references in MT evaluation

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Automatically evaluating content selection in summarization without human models

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
Evaluation of automatic summaries: metrics under varying data conditions

UCNLG+Sum '09 Proceedings of the 2009 Workshop on Language Generation and Summarisation
A study of global inference algorithms in multi-document summarization

ECIR'07 Proceedings of the 29th European conference on IR research
Significance tests of automatic machine translation evaluation metrics

Machine Translation
Non-expert evaluation of summarization systems is risky

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
Multilingual summarization evaluation without human models

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Findings of the 2011 Workshop on Statistical Machine Translation

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation

Quantified Score

Hi-index	0.00

Visualization

Abstract

The most widely adopted approaches for evaluation of summary content follow some protocol for comparing a summary with gold-standard human summaries, which are traditionally called model summaries. This evaluation paradigm falls short when human summaries are not available and becomes less accurate when only a single model is available. We propose three novel evaluation techniques. Two of them are model-free and do not rely on a gold standard for the assessment. The third technique improves standard automatic evaluations by expanding the set of available model summaries with chosen system summaries. We show that quantifying the similarity between the source text and its summary with appropriately chosen measures produces summary scores which replicate human assessments accurately. We also explore ways of increasing evaluation quality when only one human model summary is available as a gold standard. We introduce pseudomodels, which are system summaries deemed to contain good content according to automatic evaluation. Combining the pseudomodels with the single human model to form the gold-standard leads to higher correlations with human judgments compared to using only the one available model. Finally, we explore the feasibility of another measure-similarity between a system summary and the pool of all other system summaries for the same input. This method of comparison with the consensus of systems produces impressively accurate rankings of system summaries, achieving correlation with human rankings above 0.9.