Towards document plagiarism detection based on the relevance and fragmentation of the reused text

Authors:
Fernando Sánchez-Vega;Luis Villaseñor-Pineda;Manuel Montes-Y-Gómez;Paolo Rosso
Affiliations:
Laboratory of Language Technologies, Department of Computational Sciences, National Institute of Astrophysics, Optics and Electronics, Mexico;Laboratory of Language Technologies, Department of Computational Sciences, National Institute of Astrophysics, Optics and Electronics, Mexico;Laboratory of Language Technologies, Department of Computational Sciences, National Institute of Astrophysics, Optics and Electronics, Mexico and Department of Computer and Information Sciences, U ...;Natural Language Engineering Lab, ELiRF, DSIC, Universidad Politécnica de Valencia, Spain
Venue:
MICAI'10 Proceedings of the 9th Mexican international conference on Advances in artificial intelligence: Part I
Year:
2010

Citing 4
Cited 0

Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
METER: MEasuring TExt Reuse

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
On Automatic Plagiarism Detection Based on n-Grams Comparison

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Traditionally, External Plagiarism Detection has been carried out by determining and measuring the similar sections between a given pair of documents, known as source and suspicious documents. One of the main difficulties of this task resides on the fact that not all similar text sections are examples of plagiarism, since thematic coincidences also tend to produce portions of common text. In order to face this problem in this paper we propose to represent the common (possibly reused) text by means of a set of features that denote its relevance and fragmentation. This new representation, used in conjunction with supervised learning algorithms, provides more elements for the automatic detection of document plagiarism; in particular, our experimental results show that it clearly outperformed the accuracy results achieved by traditional n-gram based approaches.