Text mining applied to plagiarism detection: The use of words for detecting deviations in the writing style

Authors:
Gabriel Oberreuter;Juan D. VeláSquez
Affiliations:
Web Intelligence Consortium Chile Research Centre, Department of Industrial Engineering, Universidad de Chile, Av. República 701, P.O. Box 8370439, Chile;Web Intelligence Consortium Chile Research Centre, Department of Industrial Engineering, Universidad de Chile, Av. República 701, P.O. Box 8370439, Chile
Venue:
Expert Systems with Applications: An International Journal
Year:
2013

Citing 17
Cited 2

A training algorithm for optimal margin classifiers

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
The nature of statistical learning theory

The nature of statistical learning theory
Using linear algebra for intelligent information retrieval

SIAM Review
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Automatic authorship attribution

EACL '99 Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics
Linguistic profiling for author recognition and verification

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Authorship attribution

Foundations and Trends in Information Retrieval
Plagiarism Detection Based on Singular Value Decomposition

GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
Computational methods in authorship attribution

Journal of the American Society for Information Science and Technology
A survey of modern authorship attribution methods

Journal of the American Society for Information Science and Technology
Multilayer SOM with tree-structured data for efficient document retrieval and plagiarism detection

IEEE Transactions on Neural Networks
Intrinsic plagiarism analysis

Language Resources and Evaluation
Authorship attribution in the wild

Language Resources and Evaluation
Outlier-based approaches for intrinsic and external plagiarism detection

KES'11 Proceedings of the 15th international conference on Knowledge-based and intelligent information and engineering systems - Volume Part II
A Text Similarity Meta-Search Engine Based on Document Fingerprints and Search Results Records

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Word length n-grams for text re-use detection

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
Intrinsic plagiarism detection

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

An application for plagiarized source code detection based on a parse tree kernel

Engineering Applications of Artificial Intelligence
Dendroid: A text mining approach to analyzing and classifying code structures in Android malware families

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	12.05

Visualization

Abstract

Plagiarism detection is of special interest to educational institutions, and with the proliferation of digital documents on the Web the use of computational systems for such a task has become important. While traditional methods for automatic detection of plagiarism compute the similarity measures on a document-to-document basis, this is not always possible since the potential source documents are not always available. We do text mining, exploring the use of words as a linguistic feature for analyzing a document by modeling the writing style present in it. The main goal is to discover deviations in the style, looking for segments of the document that could have been written by another person. This can be considered as a classification problem using self-based information where paragraphs with significant deviations in style are treated as outliers. This so-called intrinsic plagiarism detection approach does not need comparison against possible sources at all, and our model relies only on the use of words, so it is not language specific. We demonstrate that this feature shows promise in this area, achieving reasonable results compared to benchmark models.