Using code metric histograms and genetic algorithms to perform author identification for software forensics

  • Authors:
  • Robert Charles Lange;Spiros Mancoridis

  • Affiliations:
  • Drexel University, Philadelphia, PA;Drexel University, Philadelphia, PA

  • Venue:
  • Proceedings of the 9th annual conference on Genetic and evolutionary computation
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

We have developed a technique to characterize software developers- styles using a set of source code metrics. This style fingerprint can be used to identify the likely author of a piece of code from a pool of candidates. Author identification has applications in criminal justice, corporate litigation, and plagiarism detection. Furthermore, we can identify candidate developers who share similar styles, making our technique useful for software maintenance as well. Our method involves measuring the differences in histogram distributions for code metrics.Identifying a combination of metrics that is effective in distinguishing developer styles is key to the utility of the technique. Our case study involves 18 metrics, and the time involved in exhaustive searching of the problem space prevented us from adding additional metrics. Using a genetic algorithm to perform the search, we were able to find good metric combinations in hours as opposed to weeks. The genetic algorithm has enabled us to begin adding new metrics to our catalog of available metrics. This paper documents the results of our experiments in author identification for software forensics and outlines future directions of research to improve the utility of our method.