Plagiarism Detection in arXiv

Authors:
Daria Sorokina;Johannes Gehrke;Simeon Warner;Paul Ginsparg
Affiliations:
Cornell University, USA;Cornell University, USA;Cornell University, USA;Cornell University, USA
Venue:
ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Year:
2006

Citing 0
Cited 7

Removing manually generated boilerplate from electronic texts: experiments with project Gutenberg e-books

CASCON '07 Proceedings of the 2007 conference of the center for advanced studies on Collaborative research
Generating links by mining quotations

Proceedings of the nineteenth ACM conference on Hypertext and hypermedia
How opinions are received by online communities: a case study on amazon.com helpfulness votes

Proceedings of the 18th international conference on World wide web
Efficient privacy-preserving similar document detection

The VLDB Journal — The International Journal on Very Large Data Bases
Detection of simple plagiarism in computer science papers

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
ATLAS: a probabilistic algorithm for high dimensional similarity search

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Experiments with filtered detection of similar academic papers

AIMSA'12 Proceedings of the 15th international conference on Artificial Intelligence: methodology, systems, and applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe a large-scale application of methods for finding plagiarism in research document collections. The methods are applied to a collection of 284,834 documents collected by arXiv.org over a 14 year period, covering a few different research disciplines. The methodology effi- ciently detects a variety of problematic author behaviors, and heuristics are developed to reduce the number of false positives. The methods are also efficient enough to imple- ment as a real-time submission screen for a collection many times larger.