Using syntactic information to identify plagiarism

  • Authors:
  • Özlem Uzuner;Boris Katz;Thade Nahnsen

  • Affiliations:
  • Computer Science and Artificial Intelligence Laboratory, Cambridge, MA;Computer Science and Artificial Intelligence Laboratory, Cambridge, MA;Computer Science and Artificial Intelligence Laboratory, Cambridge, MA

  • Venue:
  • EdAppsNLP 05 Proceedings of the second workshop on Building Educational Applications Using NLP
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Using keyword overlaps to identify plagiarism can result in many false negatives and positives: substitution of synonyms for each other reduces the similarity between works, making it difficult to recognize plagiarism; overlap in ambiguous keywords can falsely inflate the similarity of works that are in fact different in content. Plagiarism detection based on verbatim similarity of works can be rendered ineffective when works are paraphrased even in superficial and immaterial ways. Considering linguistic information related to creative aspects of writing can improve identification of plagiarism by adding a crucial dimension to evaluation of similarity: documents that share linguistic elements in addition to content are more likely to be copied from each other. In this paper, we present a set of low-level syntactic structures that capture creative aspects of writing and show that information about linguistic similarities of works improves recognition of plagiarism (over tfidf-weighted keywords alone) when combined with similarity measurements based on tfidf-weighted keywords.