Summarizing large document sets using concept-based clustering

  • Authors:
  • Hilda Hardy;Nobuyuki Shimizu;Tomek Strzalkowski;Liu Ting;G. Bowden Wise;Xinyang Zhang

  • Affiliations:
  • University at Albany, Albany, NY;University at Albany, Albany, NY;University at Albany, Albany, NY;University at Albany, Albany, NY;GE Global Research Center, Niskayuna, NY;University at Albany, Albany, NY

  • Venue:
  • HLT '02 Proceedings of the second international conference on Human Language Technology Research
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes our multi-document summarizer XDoX designed to summarize large sets of documents (50--500). These documents are typically obtained from routing or filtering systems run against a continuous stream of data, such as a newswire. XDoX identifies the most salient or often-repeated themes within the set and composes an extraction summary reflecting these main themes. The summarizer uses a unique n-gram scoring method to give greater importance to clusters of passages that have significant common phrases. Our methods are robust, topic-independent, and easily extensible to multilingual applications. We show examples of summaries obtained in our tests as well as from our participation in the first Document Understanding Conference (DUC).