Using code metric histograms and genetic algorithms to perform author identification for software forensics

Authors:
Robert Charles Lange;Spiros Mancoridis
Affiliations:
Drexel University, Philadelphia, PA;Drexel University, Philadelphia, PA
Venue:
Proceedings of the 9th annual conference on Genetic and evolutionary computation
Year:
2007

Citing 7
Cited 4

Programming style authorship analysis

CSC '89 Proceedings of the 17th conference on ACM Annual Computer Science Conference
C4.5: programs for machine learning

C4.5: programs for machine learning
A cost model for nearest neighbor search in high-dimensional data space

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Genetic Algorithms in Search, Optimization and Machine Learning

Genetic Algorithms in Search, Optimization and Machine Learning
Classification by Voting Feature Intervals

ECML '97 Proceedings of the 9th European Conference on Machine Learning
IDENTIFIED: A Dictionary-Based System for Extracting Source Code Metrics for Software Forensics

SEEP '98 Proceedings of the 1998 International Conference on Software Engineering: Education & Practice
Extraction of Java program fingerprints for software authorship identification

Journal of Systems and Software

Application of Information Retrieval Techniques for Source Code Authorship Attribution

DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
Survey: A survey on search-based software design

Computer Science Review
Search-based software engineering: Trends, techniques and applications

ACM Computing Surveys (CSUR)
Source code author identification with unsupervised feature learning

Pattern Recognition Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

We have developed a technique to characterize software developers- styles using a set of source code metrics. This style fingerprint can be used to identify the likely author of a piece of code from a pool of candidates. Author identification has applications in criminal justice, corporate litigation, and plagiarism detection. Furthermore, we can identify candidate developers who share similar styles, making our technique useful for software maintenance as well. Our method involves measuring the differences in histogram distributions for code metrics.Identifying a combination of metrics that is effective in distinguishing developer styles is key to the utility of the technique. Our case study involves 18 metrics, and the time involved in exhaustive searching of the problem space prevented us from adding additional metrics. Using a genetic algorithm to perform the search, we were able to find good metric combinations in hours as opposed to weeks. The genetic algorithm has enabled us to begin adding new metrics to our catalog of available metrics. This paper documents the results of our experiments in author identification for software forensics and outlines future directions of research to improve the utility of our method.