Analysis and extraction of sentence-level paraphrase sub-corpus in CS education

  • Authors:
  • Faisal Alvi;El-Sayed M. El-Alfy;Wasfi G. Al-Khatib;Radwan E. Abdel-Aal

  • Affiliations:
  • King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia;King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia;King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia;King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia

  • Venue:
  • Proceedings of the 13th annual conference on Information technology education
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Since the advent of the Internet, plagiarism has become a widespread problem in student submissions. Paraphrasing is one of the several types of plagiarism employed by students to mask the original source. In this work, we construct a sub-corpus of paraphrased sentences by extracting all lightly and heavily revised sentences from the Corpus of Plagiarized Short Answers, using modified criteria for sentences. We then apply document similarity measures on this sub-corpus and derive some interesting features of this sub-corpus. Our findings suggest that this sub-corpus is more suited for testing paraphrase detection techniques by providing sentence-level paraphrasing samples instead of the file-level classification provided in the original corpus. Additional sentence samples may also be added to this sub-corpus to achieve variety and scale.