Information fusion for multidocument summarization: paraphrasing and generation

  • Authors:
  • Kathleen R. Mckeown;Regina Barzilay

  • Affiliations:
  • -;-

  • Venue:
  • Information fusion for multidocument summarization: paraphrasing and generation
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

The number and variety of online news sources makes it difficult for people to track the news concerning even a single event. Redundancy causes such tracking to be extremely time-consuming: multiple news feeds on the same event tend to contain similar information. A summary of such news feeds can present important information in one short text, dramatically reducing reading time. The focus of this thesis is information fusion, a technique which, given multiple documents, identifies redundant information and synthesizes a coherent summary. This technique is embodied in MultiGen, a system that I have designed, implemented and evaluated over the course of my Ph.D. Unlike previous work in the area, MultiGen is a domain-independent system: it generates news summaries on a variety of topics in any domain. Another contribution to the state of the art is that the system generates the summary by reusing and altering phrases from the input articles, creating a more fluent and cohesive text. This is in contrast with other existing systems, which simply extract sentences from input articles and concatenate them together, leading to fluency problems. Currently MultiGen operates as part of Columbia's Newsblaster system. Everyday, Newsblaster downloads all news articles from a variety of sources, clusters articles by topic, and generates a cohesive, readable automatic summary of each document cluster. One key challenge in multidocument summarization is eliminating redundant information in the produced summaries. Articles about the same event often contain descriptions of the same fact using different wording. To address this issue, we need a method to identify paraphrases—fragments of text that convey similar meaning even if they are not identical in wording. Automatic identification of paraphrases was not addressed in previous research, although it is necessary for many applications, including question-answering, information extraction and natural language generation. This thesis presents unsupervised learning techniques to identify paraphrases given a corpus of multiple parallel texts. This type of corpus provides many instances of paraphrasing, because these texts preserve the meaning of the original source, but may use different words to convey the meaning. Both the data and the method are departures from past approaches to corpus based techniques. Our evaluation experiments show that the algorithm extracts paraphrases with high accuracy and significantly outperforms a state of the art algorithm developed for related tasks in machine translation.