Examining the significance of high-level programming features in source code author classification

Authors:
Georgia Frantzeskou;Stephen MacDonell;Efstathios Stamatatos;Stefanos Gritzalis
Affiliations:
Department of Information and Communication Systems Engineering, University of the Aegean, Samos 83200, Greece;School of Computing and Mathematical Sciences, Auckland University of Technology, Private Bag 92006, Auckland 1020, New Zealand;Department of Information and Communication Systems Engineering, University of the Aegean, Samos 83200, Greece;Department of Information and Communication Systems Engineering, University of the Aegean, Samos 83200, Greece
Venue:
Journal of Systems and Software
Year:
2008

Citing 16
Cited 1

The internet worm program: an analysis

ACM SIGCOMM Computer Communication Review
Programming style authorship analysis

CSC '89 Proceedings of the 17th conference on ACM Annual Computer Science Conference
An empirical study of COBOL programs via a style analyzer: the benefits of good programming style

Journal of Systems and Software - Special issue on software engineering education
A programming style taxonomy

Journal of Systems and Software
Beyond preliminary analysis of the WANK and OILZ worms: a case study of malicious code

Computers and Security
Software forensics: can we track code to its authors?

Computers and Security
Linguistic laws and computer programs

Journal of the American Society for Information Science
Computer and natural language texts—a comparison based on long-range correlations

Journal of the American Society for Information Science
Java Software Solutions: Foundations of Program Design with Cdrom

Java Software Solutions: Foundations of Program Design with Cdrom
IDENTIFIED: A Dictionary-Based System for Extracting Source Code Metrics for Software Forensics

SEEP '98 Proceedings of the 1998 International Conference on Software Engineering: Education & Practice
Augmenting Naive Bayes Classifiers with Statistical Language Models

Information Retrieval
Automatic text categorization in terms of genre and author

Computational Linguistics
Practical Common Lisp

Practical Common Lisp
Extraction of Java program fingerprints for software authorship identification

Journal of Systems and Software
Effective identification of source code authors using byte-level information

Proceedings of the 28th international conference on Software engineering
Authorship analysis in cybercrime investigation

ISI'03 Proceedings of the 1st NSF/NIJ conference on Intelligence and security informatics

Code analyzer for an online course management system

Journal of Systems and Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

The use of Source Code Author Profiles (SCAP) represents a new, highly accurate approach to source code authorship identification that is, unlike previous methods, language independent. While accuracy is clearly a crucial requirement of any author identification method, in cases of litigation regarding authorship, plagiarism, and so on, there is also a need to know why it is claimed that a piece of code is written by a particular author. What is it about that piece of code that suggests a particular author? What features in the code make one author more likely than another? In this study, we describe a means of identifying the high-level features that contribute to source code authorship identification using as a tool the SCAP method. A variety of features are considered for Java and Common Lisp and the importance of each feature in determining authorship is measured through a sequence of experiments in which we remove one feature at a time. The results show that, for these programs, comments, layout features and package-related naming influence classification accuracy whereas user-defined naming, an obvious programmer related feature, does not appear to influence accuracy. A comparison is also made between the relative feature contributions in programs written in the two languages.