Evolving similarity functions for code plagiarism detection

  • Authors:
  • Vic Ciesielski;Nelson Wu;Seyed Tahaghoghi

  • Affiliations:
  • RMIT University, Melbourne, Australia;RMIT University, Melbourne, Australia;RMIT University, Melbourne, Australia

  • Venue:
  • Proceedings of the 10th annual conference on Genetic and evolutionary computation
  • Year:
  • 2008

Quantified Score

Hi-index 0.01

Visualization

Abstract

Detecting whether computer program code is a student's original work or has been copied from another student or some other source is a major problem for many universities. Detection methods based on the information retrieval concepts of indexing and similarity matching scale well to large collections of files, but require appropriate similarity functions for good performance. We have used particle swarm optimization and genetic programming to evolve similarity functions that are suited to computer program code. Using a training set of plagiarised and non-plagiarised programs we have evolved better parameter values for the previously published Okapi BM25 similarity function. We have then used genetic programming to evolve completely new similarity functions that do not conform to any predetermined structure. We found that the evolved similarity functions outperformed the human developed Okapi BM25 function. We also found that a detection system using the evolved functions was more accurate than the the best code plagiarism detection system in use today, and scales much better to large collections of files. The evolutionary computing techniques have been extremely useful in finding similarity functions that advance the state of the art in code plagiarism detection.