Communications of the ACM
A taxonomy for programming style
CSC '90 Proceedings of the 1990 ACM annual conference on Cooperation
Beyond preliminary analysis of the WANK and OILZ worms: a case study of malicious code
Computers and Security
Pivoted document length normalization
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Approximate nearest neighbors: towards removing the curse of dimensionality
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
A probabilistic model of information retrieval: development and comparative experiments
Information Processing and Management: an International Journal
A probabilistic model of information retrieval: development and comparative experiments Part 2
Information Processing and Management: an International Journal
Metrics based plagarism monitoring
CCSC '01 Proceedings of the sixth annual CCSC northeastern conference on The journal of computing in small colleges
Modern Information Retrieval
Winnowing: local algorithms for document fingerprinting
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A study of smoothing methods for language models applied to information retrieval
ACM Transactions on Information Systems (TOIS)
Extraction of Java program fingerprints for software authorship identification
Journal of Systems and Software
Effective identification of source code authors using byte-level information
Proceedings of the 28th international conference on Software engineering
Efficient plagiarism detection for large code repositories
Software—Practice & Experience
Proceedings of the 9th annual conference on Genetic and evolutionary computation
Strategies for retrieving plagiarized documents
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting outsourced student programming assignments
Journal of Computing Sciences in Colleges
Special Feature: Software Theft
IEEE Software
Rank-biased precision for measurement of retrieval effectiveness
ACM Transactions on Information Systems (TOIS)
Effective and scalable authorship attribution using function words
AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
Using relative entropy for authorship attribution
AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
Support vector machines for spam categorization
IEEE Transactions on Neural Networks
Authorship classification: a discriminative syntactic tree mining approach
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Hi-index | 0.01 |
Authorship attribution assigns works of contentious authorship to their rightful owners solving cases of theft, plagiarism and authorship disputes in academia and industry. In this paper we investigate the application of information retrieval techniques to attribution of authorship of C source code. In particular, we explore novel methods for converting C code into documents suitable for retrieval systems, experimenting with 1,597 student programming assignments. We investigate several possible program derivations, partition attribution results by original program length to measure effectiveness of modest and lengthy programs separately, and evaluate three different methods for interpreting document rankings as authorship attribution. The best of our methods achieves an average of 76.78% classification accuracy for a one-in-ten classification problem which is competitive against six existing baselines. The techniques that we present can be the basis of practical software to support source code authorship investigations.