Information retrieval techniques for corpus filtering applied to external plagiarism detection

Authors:
Daniel Micol;Oscar Ferrández;Rafael Muñoz
Affiliations:
Research Group on Natural Language Processing and Information Systems, Department of Software and Computing Systems, University of Alicante, Alicante, Spain;Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah;Research Group on Natural Language Processing and Information Systems, Department of Software and Computing Systems, University of Alicante, Alicante, Spain
Venue:
NLDB'11 Proceedings of the 16th international conference on Natural language processing and information systems
Year:
2011

Citing 6
Cited 1

CHECK: a document plagiarism detection system

SAC '97 Proceedings of the 1997 ACM symposium on Applied computing
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Strategies for retrieving plagiarized documents

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval

Introduction to Information Retrieval
Reducing the Plagiarism Detection Search Space on the Basis of the Kullback-Leibler Distance

CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
WordNet: similarity - measuring the relatedness of concepts

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence

Automated crime report analysis and classification for e-government and decision support

Proceedings of the 14th Annual International Conference on Digital Government Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a set of approaches for corpus filtering in the context of document external plagiarism detection. Producing filtered sets, and hence limiting the problem's search space, can be a performance improvement and is used today in many real-world applications such as web search engines. With regards to document plagiarism detection, the database of documents to match the suspicious candidate against is potentially fairly large, and hence it becomes very recommendable to apply filtered set generation techniques. The approaches that we have implemented include information retrieval methods and a document similarity measure based on a variant of tf-idf. Furthermore, we perform textual comparisons, as well as a semantic similarity analysis in order to capture higher levels of obfuscation.