Quantifying the limits and success of extractive summarization systems across domains

  • Authors:
  • Hakan Ceylan;Rada Mihalcea;Umut Özertem;Elena Lloret;Manuel Palomar

  • Affiliations:
  • University of North Texas, Denton, TX;University of North Texas, Denton, TX;Yahoo! Labs, Sunnyvale, CA;University of Alicante, Alicante, Spain;University of Alicante, Alicante, Spain

  • Venue:
  • HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper analyzes the topic identification stage of single-document automatic text summarization across four different domains, consisting of newswire, literary, scientific and legal documents. We present a study that explores the summary space of each domain via an exhaustive search strategy, and finds the probability density function (pdf) of the ROUGE score distributions for each domain. We then use this pdf to calculate the percentile rank of extractive summarization systems. Our results introduce a new way to judge the success of automatic summarization systems and bring quantified explanations to questions such as why it was so hard for the systems to date to have a statistically significant improvement over the lead baseline in the news domain.