Plagiarism Detection in arXiv

  • Authors:
  • Daria Sorokina;Johannes Gehrke;Simeon Warner;Paul Ginsparg

  • Affiliations:
  • Cornell University, USA;Cornell University, USA;Cornell University, USA;Cornell University, USA

  • Venue:
  • ICDM '06 Proceedings of the Sixth International Conference on Data Mining
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

We describe a large-scale application of methods for finding plagiarism in research document collections. The methods are applied to a collection of 284,834 documents collected by arXiv.org over a 14 year period, covering a few different research disciplines. The methodology effi- ciently detects a variety of problematic author behaviors, and heuristics are developed to reduce the number of false positives. The methods are also efficient enough to imple- ment as a real-time submission screen for a collection many times larger.