Summarizing large document sets using concept-based clustering

Authors:
Hilda Hardy;Nobuyuki Shimizu;Tomek Strzalkowski;Liu Ting;G. Bowden Wise;Xinyang Zhang
Affiliations:
University at Albany, Albany, NY;University at Albany, Albany, NY;University at Albany, Albany, NY;University at Albany, Albany, NY;GE Global Research Center, Niskayuna, NY;University at Albany, Albany, NY
Venue:
HLT '02 Proceedings of the second international conference on Human Language Technology Research
Year:
2002

Citing 6
Cited 3

Recent trends in hierarchic document clustering: a critical review

Information Processing and Management: an International Journal
Generating summaries of multiple news articles

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
WordNet: a lexical database for English

Communications of the ACM
Pivoted document length normalization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
The use of MMR, diversity-based reranking for reordering documents and producing summaries

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Multi-paragraph segmentation of expository text

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics

Extractive summaries for educational science content

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Pedagogically useful extractive summaries for science education

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Correlation based multi-document summarization for scientific articles and news group

Proceedings of the International Conference on Advances in Computing, Communications and Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes our multi-document summarizer XDoX designed to summarize large sets of documents (50--500). These documents are typically obtained from routing or filtering systems run against a continuous stream of data, such as a newswire. XDoX identifies the most salient or often-repeated themes within the set and composes an extraction summary reflecting these main themes. The summarizer uses a unique n-gram scoring method to give greater importance to clusters of passages that have significant common phrases. Our methods are robust, topic-independent, and easily extensible to multilingual applications. We show examples of summaries obtained in our tests as well as from our participation in the first Document Understanding Conference (DUC).