Automatic text categorization in terms of genre and author
Computational Linguistics
Building a large annotated corpus of English: the penn treebank
Computational Linguistics - Special issue on using large corpora: II
Experiments on sentence boundary detection
ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Accurate unlexicalized parsing
ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
A survey on tree edit distance and related problems
Theoretical Computer Science
The pq-gram distance between ordered labeled trees
ACM Transactions on Database Systems (TODS)
An evaluation framework for plagiarism detection
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Hi-index | 0.00 |
Intrinsic plagiarism detection deals with the task of finding plagiarized sections of text documents without using a reference corpus. This paper describes a novel approach to this task by processing and analyzing the grammar of a suspicious document. The main idea is to split a text into single sentences and to calculate grammar trees. To find suspicious sentences, these grammar trees are compared in a distance matrix by using the pq-gram-distance, an alternative for the tree edit distance. Finally, significantly different sentences regarding their grammar and with respect to the Gaussian normal distribution are marked as suspicious.