Information fusion for multidocument summarization: paraphrasing and generation

Authors:
Kathleen R. Mckeown;Regina Barzilay
Affiliations:
-;-
Venue:
Information fusion for multidocument summarization: paraphrasing and generation
Year:
2003

Citing 0
Cited 30

Query based event extraction along a timeline

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic text structuring: experiments with sentence ordering

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Sentence Fusion for Multidocument News Summarization

Computational Linguistics
Towards developing generation algorithms for text-to-text applications

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Syntactic simplification for improving content selection in multi-document summarization

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Improving multilingual summarization: using redundancy in the input to correct MT errors

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Automatically learning cognitive status for multi-document summarization of newswire

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Automatic Evaluation of Information Ordering: Kendall's Tau

Computational Linguistics
Dependency-Based Construction of Semantic Space Models

Computational Linguistics
Abstractive headline generation using WIDL-expressions

Information Processing and Management: an International Journal
Modeling local coherence: An entity-based approach

Computational Linguistics
Web warehouse - a new web information fusion tool for web mining

Information Fusion
ParaMT: A Paraphraser for Machine Translation

PROPOR '08 Proceedings of the 8th international conference on Computational Processing of the Portuguese Language
Constructing corpora for the development and evaluation of paraphrase systems

Computational Linguistics
Improving meeting summarization by focusing on user needs: a task-oriented evaluation

Proceedings of the 14th international conference on Intelligent user interfaces
Generating research websites using summarisation techniques

HLT-Demonstrations '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Demo Session
ParaMetric: an automatic evaluation metric for paraphrasing

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Sentence compression beyond word deletion

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Domain-independent shallow sentence ordering

SRWS '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Student Research Workshop and Doctoral Consortium
Extracting paraphrase patterns from bilingual parallel corpora

Natural Language Engineering
Classification of semantic relations by humans and machines

EMSEE '05 Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment
Extracting lay paraphrases of specialized expressions from monolingual comparable medical corpora

BUCC '09 Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora
Automatic alignment of common information in comparable sentences of Portuguese

Companion Proceedings of the XIV Brazilian Symposium on Multimedia and the Web
Information status distinctions and referring expressions: An empirical study of references to people in news summaries

Computational Linguistics
Learning to simplify sentences with quasi-synchronous grammar and integer programming

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Learning sentential paraphrases from bilingual parallel corpora for text-to-text generation

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Monolingual distributional similarity for text-to-text generation

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
An abstractive approach to sentence compression

ACM Transactions on Intelligent Systems and Technology (TIST) - Special Sections on Paraphrasing; Intelligent Systems for Socially Aware Computing; Social Computing, Behavioral-Cultural Modeling, and Prediction
Sentence fusion for multidocument news summarization

Computational Linguistics
Plagiarism meets paraphrasing: Insights for the next generation in automatic plagiarism detection

Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

The number and variety of online news sources makes it difficult for people to track the news concerning even a single event. Redundancy causes such tracking to be extremely time-consuming: multiple news feeds on the same event tend to contain similar information. A summary of such news feeds can present important information in one short text, dramatically reducing reading time. The focus of this thesis is information fusion, a technique which, given multiple documents, identifies redundant information and synthesizes a coherent summary. This technique is embodied in MultiGen, a system that I have designed, implemented and evaluated over the course of my Ph.D. Unlike previous work in the area, MultiGen is a domain-independent system: it generates news summaries on a variety of topics in any domain. Another contribution to the state of the art is that the system generates the summary by reusing and altering phrases from the input articles, creating a more fluent and cohesive text. This is in contrast with other existing systems, which simply extract sentences from input articles and concatenate them together, leading to fluency problems. Currently MultiGen operates as part of Columbia's Newsblaster system. Everyday, Newsblaster downloads all news articles from a variety of sources, clusters articles by topic, and generates a cohesive, readable automatic summary of each document cluster. One key challenge in multidocument summarization is eliminating redundant information in the produced summaries. Articles about the same event often contain descriptions of the same fact using different wording. To address this issue, we need a method to identify paraphrases—fragments of text that convey similar meaning even if they are not identical in wording. Automatic identification of paraphrases was not addressed in previous research, although it is necessary for many applications, including question-answering, information extraction and natural language generation. This thesis presents unsupervised learning techniques to identify paraphrases given a corpus of multiple parallel texts. This type of corpus provides many instances of paraphrasing, because these texts preserve the meaning of the original source, but may use different words to convey the meaning. Both the data and the method are departures from past approaches to corpus based techniques. Our evaluation experiments show that the algorithm extracts paraphrases with high accuracy and significantly outperforms a state of the art algorithm developed for related tasks in machine translation.