Local text reuse detection

  • Authors:
  • Jangwon Seo;W. Bruce Croft

  • Affiliations:
  • University of Massachusetts, Amherst, Amherst, MA, USA;University of Massachusetts, Amherst, Amherst, MA, USA

  • Venue:
  • Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Text reuse occurs in many different types of documents and for many different reasons. One form of reuse, duplicate or near-duplicate documents, has been a focus of researchers because of its importance in Web search. Local text reuse occurs when sentences, facts or passages, rather than whole documents, are reused and modified. Detecting this type of reuse can be the basis of new tools for text analysis. In this paper, we introduce a new approach to detecting local text reuse and compare it to other approaches. This comparison involves a study of the amount and type of reuse that occurs in real documents, including TREC newswire and blog collections.