Source code author identification with unsupervised feature learning

Authors:
Upul Bandara;Gamini Wijayarathna
Affiliations:
Department of Industrial Management, University of Kelaniya, Sri Lanka;Department of Industrial Management, University of Kelaniya, Sri Lanka
Venue:
Pattern Recognition Letters
Year:
2013

Citing 11
Cited 0

Learning internal representations by error propagation

Parallel distributed processing: explorations in the microstructure of cognition, vol. 1
"Uni cheats racket": a case study in plagiarism investigation

ACE '04 Proceedings of the Sixth Australasian Conference on Computing Education - Volume 30
GPLAG: detection of software plagiarism by program dependence graph analysis

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Self-taught learning: transfer learning from unlabeled data

Proceedings of the 24th international conference on Machine learning
Using code metric histograms and genetic algorithms to perform author identification for software forensics

Proceedings of the 9th annual conference on Genetic and evolutionary computation
Detecting outsourced student programming assignments

Journal of Computing Sciences in Colleges
On the Use of Discretized Source Code Metrics for Author Identification

SSBSE '09 Proceedings of the 2009 1st International Symposium on Search Based Software Engineering
Learning Deep Architectures for AI

Foundations and Trends® in Machine Learning
On the expressive power of deep architectures

DS'11 Proceedings of the 14th international conference on Discovery science
Text Detection and Character Recognition in Scene Images with Unsupervised Feature Learning

ICDAR '11 Proceedings of the 2011 International Conference on Document Analysis and Recognition
Random search for hyper-parameter optimization

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.10

Visualization

Abstract

Automatic identification of source code authors has many applications in different fields such as source code plagiarism detection, and law suit prosecution. This paper presents a new source code author identification system based on an unsupervised feature learning technique. As a method of extracting features from high dimensional data, unsupervised feature learning has obtained a great success in many fields such as character recognition and image classification. However, according to our knowledge it has not been applied for source code author identification systems. Therefore, we investigated an unsupervised feature learning technique called sparse auto-encoder as a method of extracting features from source code files. Our system was evaluated with several datasets and results have shown that performance is very close to the state of art techniques in the source code identification field.