Effective identification of source code authors using byte-level information

Authors:
Georgia Frantzeskou;Efstathios Stamatatos;Stefanos Gritzalis;Sokratis Katsikas
Affiliations:
University of the Aegean, Karlovasi, Greece;University of the Aegean, Karlovasi, Greece;University of the Aegean, Karlovasi, Greece;University of the Aegean, Karlovasi, Greece
Venue:
Proceedings of the 28th international conference on Software engineering
Year:
2006

Citing 8
Cited 3

The internet worm program: an analysis

ACM SIGCOMM Computer Communication Review
Beyond preliminary analysis of the WANK and OILZ worms: a case study of malicious code

Computers and Security
Software forensics: can we track code to its authors?

Computers and Security
Software forensics: old methods for a new science

SEEP '96 Proceedings of the 1996 International Conference on Software Engineering: Education and Practice (SE:EP '96)
IDENTIFIED: A Dictionary-Based System for Extracting Source Code Metrics for Software Forensics

SEEP '98 Proceedings of the 1998 International Conference on Software Engineering: Education & Practice
Augmenting Naive Bayes Classifiers with Statistical Language Models

Information Retrieval
Automatic text categorization in terms of genre and author

Computational Linguistics
Extraction of Java program fingerprints for software authorship identification

Journal of Systems and Software

Examining the significance of high-level programming features in source code author classification

Journal of Systems and Software
A survey of modern authorship attribution methods

Journal of the American Society for Information Science and Technology
Application of Information Retrieval Techniques for Source Code Authorship Attribution

DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Source code author identification deals with the task of identifying the most likely author of a computer program, given a set of predefined author candidates. This is usually .based on the analysis of other program samples of undisputed authorship by the same programmer. There are several cases where the application of such a method could be of a major benefit, such as authorship disputes, proof of authorship in court, tracing the source of code left in the system after a cyber attack, etc. We present a new approach, called the SCAP (Source Code Author Profiles) approach, based on byte-level n-gram profiles in order to represent a source code author's style. Experiments on data sets of different programming-language (Java or C++) and varying difficulty (6 to 30 candidate authors) demonstrate the effectiveness of the proposed approach.A comparison with a previous source code authorship identification study based on more complicated information shows that the SCAP approach is language independent and that n-gram author profiles are better able to capture the idiosyncrasies of the source code authors. Moreover, the SCAP approach is able to deal surprisingly well with cases where only a limited amount of very short programs per programmer is available for training. It is also demonstrated that the effectiveness of the proposed model is not affected by the absence of comments in the source code, a condition usually met in cyber-crime cases.