Information retrieval techniques for corpus filtering applied to external plagiarism detection

  • Authors:
  • Daniel Micol;Oscar Ferrández;Rafael Muñoz

  • Affiliations:
  • Research Group on Natural Language Processing and Information Systems, Department of Software and Computing Systems, University of Alicante, Alicante, Spain;Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah;Research Group on Natural Language Processing and Information Systems, Department of Software and Computing Systems, University of Alicante, Alicante, Spain

  • Venue:
  • NLDB'11 Proceedings of the 16th international conference on Natural language processing and information systems
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a set of approaches for corpus filtering in the context of document external plagiarism detection. Producing filtered sets, and hence limiting the problem's search space, can be a performance improvement and is used today in many real-world applications such as web search engines. With regards to document plagiarism detection, the database of documents to match the suspicious candidate against is potentially fairly large, and hence it becomes very recommendable to apply filtered set generation techniques. The approaches that we have implemented include information retrieval methods and a document similarity measure based on a variant of tf-idf. Furthermore, we perform textual comparisons, as well as a semantic similarity analysis in order to capture higher levels of obfuscation.