Application of Information Retrieval Techniques for Source Code Authorship Attribution

  • Authors:
  • Steven Burrows;Alexandra L. Uitdenbogerd;Andrew Turpin

  • Affiliations:
  • School of Computer Science and Information Technology, RMIT University, Melbourne, Australia 3001;School of Computer Science and Information Technology, RMIT University, Melbourne, Australia 3001;School of Computer Science and Information Technology, RMIT University, Melbourne, Australia 3001

  • Venue:
  • DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
  • Year:
  • 2009

Quantified Score

Hi-index 0.01

Visualization

Abstract

Authorship attribution assigns works of contentious authorship to their rightful owners solving cases of theft, plagiarism and authorship disputes in academia and industry. In this paper we investigate the application of information retrieval techniques to attribution of authorship of C source code. In particular, we explore novel methods for converting C code into documents suitable for retrieval systems, experimenting with 1,597 student programming assignments. We investigate several possible program derivations, partition attribution results by original program length to measure effectiveness of modest and lengthy programs separately, and evaluate three different methods for interpreting document rankings as authorship attribution. The best of our methods achieves an average of 76.78% classification accuracy for a one-in-ten classification problem which is competitive against six existing baselines. The techniques that we present can be the basis of practical software to support source code authorship investigations.