Reducing the Plagiarism Detection Search Space on the Basis of the Kullback-Leibler Distance

Authors:
Alberto Barrón-Cedeño;Paolo Rosso;José-Miguel Benedí
Affiliations:
Department of Information Systems and Computation, Universidad Politécnica de Valencia, Valencia, Spain 46022;Department of Information Systems and Computation, Universidad Politécnica de Valencia, Valencia, Spain 46022;Department of Information Systems and Computation, Universidad Politécnica de Valencia, Valencia, Spain 46022
Venue:
CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Year:
2009

Citing 7
Cited 9

CHECK: a document plagiarism detection system

SAC '97 Proceedings of the 1997 ACM symposium on Applied computing
Principles of hash-based text retrieval

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Using Kullback-Leibler distance for text categorization

ECIR'03 Proceedings of the 25th European conference on IR research
Clustering abstracts of scientific texts using the transition point technique

CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
PPChecker: plagiarism pattern checker in document copy detection

TSD'06 Proceedings of the 9th international conference on Text, Speech and Dialogue
Intrinsic plagiarism detection

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Information distance

IEEE Transactions on Information Theory

A new approach for cross-language plagiarism analysis

CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
SimPaD: A word-similarity sentence-based plagiarism detection tool on Web documents

Web Intelligence and Agent Systems
Hypergeometric language models for republished article finding

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Information retrieval techniques for corpus filtering applied to external plagiarism detection

NLDB'11 Proceedings of the 16th international conference on Natural language processing and information systems
Outlier-based approaches for intrinsic and external plagiarism detection

KES'11 Proceedings of the 15th international conference on Knowledge-based and intelligent information and engineering systems - Volume Part II
Detection of near-duplicate user generated contents: the SMS spam collection

Proceedings of the 3rd international workshop on Search and mining user-generated contents
Retrieving candidate plagiarised documents using query expansion

ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
Detecting text reuse with modified and weighted n-grams

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
Monitoring User Evolution in Twitter

ASONAM '12 Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic plagiarism detection considering a reference corpus compares a suspicious text to a set of original documents in order to relate the plagiarised fragments to their potential source. Publications on this task often assume that the search space (the set of reference documents) is a narrow set where any search strategy will produce a good output in a short time. However, this is not always true. Reference corpora are often composed of a big set of original documents where a simple exhaustive search strategy becomes practically impossible. Before carrying out an exhaustive search, it is necessary to reduce the search space, represented by the documents in the reference corpus, as much as possible. Our experiments with the METER corpus show that a previous search space reduction stage, based on the Kullback-Leibler symmetric distance, reduces the search process time dramatically. Additionally, it improves the Precision and Recall obtained by a search strategy based on the exhaustive comparison of word n -grams.