Determining and characterizing the reused text for plagiarism detection

Authors:
Fernando SáNchez-Vega;Esaú Villatoro-Tello;Manuel Montes-Y-GóMez;Luis VillaseñOr-Pineda;Paolo Rosso
Affiliations:
Lab. de Tecnologías del Lenguaje, Coordinación de Ciencias Computacionales, Instituto Nacional de Astrofísica, íptica y Electrónica (INAOE), Mexico.;Information Technologies Department, Universidad Autónoma Metropolitana (UAM), Mexico;Lab. de Tecnologías del Lenguaje, Coordinación de Ciencias Computacionales, Instituto Nacional de Astrofísica, íptica y Electrónica (INAOE), Mexico.;Lab. de Tecnologías del Lenguaje, Coordinación de Ciencias Computacionales, Instituto Nacional de Astrofísica, íptica y Electrónica (INAOE), Mexico.;Natural Language Engineering Lab., ELiRF, Universitat Politècnica de València, Spain
Venue:
Expert Systems with Applications: An International Journal
Year:
2013

Citing 12
Cited 0

CHECK: a document plagiarism detection system

SAC '97 Proceedings of the 1997 ACM symposium on Applied computing
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
METER: MEasuring TExt Reuse

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Plagiarism detection using feature-based neural networks

Proceedings of the 38th SIGCSE technical symposium on Computer science education
Local text reuse detection

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
On Automatic Plagiarism Detection Based on n-Grams Comparison

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Detection of simple plagiarism in computer science papers

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Semantic duplicate identification with parsing and machine learning

TSD'10 Proceedings of the 13th international conference on Text, speech and dialogue
An evaluation framework for plagiarism detection

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Developing a corpus of plagiarised short answers

Language Resources and Evaluation
Word length n-grams for text re-use detection

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
Who's the thief? automatic detection of the direction of plagiarism

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing

Quantified Score

Hi-index	12.05

Visualization

Abstract

An important task in plagiarism detection is determining and measuring similar text portions between a given pair of documents. One of the main difficulties of this task resides on the fact that reused text is commonly modified with the aim of covering or camouflaging the plagiarism. Another difficulty is that not all similar text fragments are examples of plagiarism, since thematic coincidences also tend to produce portions of similar text. In order to tackle these problems, we propose a novel method for detecting likely portions of reused text. This method is able to detect common actions performed by plagiarists such as word deletion, insertion and transposition, allowing to obtain plausible portions of reused text. We also propose representing the identified reused text by means of a set of features that denote its degree of plagiarism, relevance and fragmentation. This new representation aims to facilitate the recognition of plagiarism by considering diverse characteristics of the reused text during the classification phase. Experimental results employing a supervised classification strategy showed that the proposed method is able to outperform traditionally used approaches.