A framework for identifying textual redundancy

Authors:
Kapil Thadani;Kathleen McKeown
Affiliations:
Columbia University, New York, NY;Columbia University, New York, NY
Venue:
COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Year:
2008

Citing 10
Cited 1

Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems

Approximation algorithms for NP-hard problems
The use of MMR, diversity-based reranking for reordering documents and producing summaries

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Normalized Cuts and Image Segmentation

IEEE Transactions on Pattern Analysis and Machine Intelligence
From single to multi-document summarization: a prototype system and its evaluation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Learning to paraphrase: an unsupervised approach using multiple-sequence alignment

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Syntax-based alignment of multiple translations: extracting paraphrases and generating new sentences

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Sentence Fusion for Multidocument News Summarization

Computational Linguistics
A formal model for information selection in multi-sentence text extraction

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Syntactic simplification for improving content selection in multi-document summarization

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
The Pyramid Method: Incorporating human content selection variation in summarization evaluation

ACM Transactions on Speech and Language Processing (TSLP)

Towards strict sentence intersection: decoding and evaluation strategies

MTTG '11 Proceedings of the Workshop on Monolingual Text-To-Text Generation

Quantified Score

Hi-index	0.00

Visualization

Abstract

The task of identifying redundant information in documents that are generated from multiple sources provides a significant challenge for summarization and QA systems. Traditional clustering techniques detect redundancy at the sentential level and do not guarantee the preservation of all information within the document. We discuss an algorithm that generates a novel graph-based representation for a document and then utilizes a set cover approximation algorithm to remove redundant text from it. Our experiments show that this approach offers a significant performance advantage over clustering when evaluated over an annotated dataset.