Word length n-grams for text re-use detection

Authors:
Alberto Barrón-Cedeño;Chiara Basile;Mirko Degli Esposti;Paolo Rosso
Affiliations:
NLEL-ELiRF, Department of Information Systems and Computation, Universidad Politécnica de Valencia, Spain;Dipartimento di Matematica, Università di Bologna, Italy;Dipartimento di Matematica, Università di Bologna, Italy;NLEL-ELiRF, Department of Information Systems and Computation, Universidad Politécnica de Valencia, Spain
Venue:
CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
Year:
2010

Citing 9
Cited 3

Modern Information Retrieval

Modern Information Retrieval
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
METER: MEasuring TExt Reuse

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Similarity measures for tracking information flow

Proceedings of the 14th ACM international conference on Information and knowledge management
Speech and Language Processing (2nd Edition)

Speech and Language Processing (2nd Edition)
Strategies for retrieving plagiarized documents

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Using Kullback-Leibler distance for text categorization

ECIR'03 Proceedings of the 25th European conference on IR research
PPChecker: plagiarism pattern checker in document copy detection

TSD'06 Proceedings of the 9th international conference on Text, Speech and Dialogue

Using structural information and citation evidence to detect significant plagiarism cases in scientific publications

Journal of the American Society for Information Science and Technology
Determining and characterizing the reused text for plagiarism detection

Expert Systems with Applications: An International Journal
Text mining applied to plagiarism detection: The use of words for detecting deviations in the writing style

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

The automatic detection of shared content in written documents –which includes text reuse and its unacknowledged commitment, plagiarism– has become an important problem in Information Retrieval. This task requires exhaustive comparison of texts in order to determine how similar they are. However, such comparison is impossible in those cases where the amount of documents is too high. Therefore, we have designed a model for the proper pre-selection of closely related documents in order to perform the exhaustive comparison afterwards. We use a similarity measure based on word-level n-grams, which proved to be quite effective in many applications As this approach becomes normally impracticable for real-world large datasets, we propose a method based on a preliminary word-length encoding of texts, substituting a word by its length, providing three important advantages: (i) being the alphabet of the documents reduced to nine symbols, the space needed to store n-gram lists is reduced; (ii) computation times are decreased; and (iii) length n-grams can be represented in a trie, allowing a more flexible and fast comparison. We experimentally show, on the basis of the perplexity measure, that the noise introduced by the length encoding does not decrease importantly the expressiveness of the text. The method is then tested on two large datasets of co-derivatives and simulated plagiarism.