Candidate document retrieval for web-scale text reuse detection

  • Authors:
  • Matthias Hagen;Benno Stein

  • Affiliations:
  • Faculty of Media, Bauhaus-Universität Weimar, Germany;Faculty of Media, Bauhaus-Universität Weimar, Germany

  • Venue:
  • SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
  • Year:
  • 2011

Quantified Score

Hi-index 0.01

Visualization

Abstract

Given a document d, the task of text reuse detection is to find those passages in d which in identical or paraphrased form also appear in other documents. To solve this problem at web-scale, keywords representing d's topics have to be combined to web queries. The retrieved web documents can then be delivered to a text reuse detection system for an in-depth analysis. We focus on the query formulation problem as the crucial first step in the detection process and present a new query formulation strategy that achieves convincing results: compared to a maximal termset query formulation strategy [10, 14], which is the most sensible non-heuristic baseline, we save on average 70% of the queries in realistic experiments. With respect to the candidate documents' quality, our heuristic retrieves documents that are, on average, more similar to the given document than the results of previously published query formulation strategies [4, 8].