Programming style authorship analysis
CSC '89 Proceedings of the 17th conference on ACM Annual Computer Science Conference
C4.5: programs for machine learning
C4.5: programs for machine learning
A cost model for nearest neighbor search in high-dimensional data space
PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Genetic Algorithms in Search, Optimization and Machine Learning
Genetic Algorithms in Search, Optimization and Machine Learning
Classification by Voting Feature Intervals
ECML '97 Proceedings of the 9th European Conference on Machine Learning
IDENTIFIED: A Dictionary-Based System for Extracting Source Code Metrics for Software Forensics
SEEP '98 Proceedings of the 1998 International Conference on Software Engineering: Education & Practice
Extraction of Java program fingerprints for software authorship identification
Journal of Systems and Software
Application of Information Retrieval Techniques for Source Code Authorship Attribution
DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
Survey: A survey on search-based software design
Computer Science Review
Search-based software engineering: Trends, techniques and applications
ACM Computing Surveys (CSUR)
Source code author identification with unsupervised feature learning
Pattern Recognition Letters
Hi-index | 0.00 |
We have developed a technique to characterize software developers- styles using a set of source code metrics. This style fingerprint can be used to identify the likely author of a piece of code from a pool of candidates. Author identification has applications in criminal justice, corporate litigation, and plagiarism detection. Furthermore, we can identify candidate developers who share similar styles, making our technique useful for software maintenance as well. Our method involves measuring the differences in histogram distributions for code metrics.Identifying a combination of metrics that is effective in distinguishing developer styles is key to the utility of the technique. Our case study involves 18 metrics, and the time involved in exhaustive searching of the problem space prevented us from adding additional metrics. Using a genetic algorithm to perform the search, we were able to find good metric combinations in hours as opposed to weeks. The genetic algorithm has enabled us to begin adding new metrics to our catalog of available metrics. This paper documents the results of our experiments in author identification for software forensics and outlines future directions of research to improve the utility of our method.