Towards the detection of cross-language source code reuse

Authors:
Enrique Flores;Alberto Barrón-Cedeño;Paolo Rosso;Lidia Moreno
Affiliations:
Dpto. de Sistemas Informáticos y Computación, Universidad Politécnica de Valencia;Dpto. de Sistemas Informáticos y Computación, Universidad Politécnica de Valencia;Dpto. de Sistemas Informáticos y Computación, Universidad Politécnica de Valencia;Dpto. de Sistemas Informáticos y Computación, Universidad Politécnica de Valencia
Venue:
NLDB'11 Proceedings of the 16th international conference on Natural language processing and information systems
Year:
2011

Citing 6
Cited 1

An empirical approach for detecting program similarity and plagiarism within a university programming environment

Computers & Education
Detecting plagiarism in student Pascal programs

The Computer Journal
Plagiarism detection across programming languages

ACSC '06 Proceedings of the 29th Australasian Computer Science Conference - Volume 48
A statistical approach to crosslingual natural language tasks

Journal of Algorithms
Cross-language plagiarism detection

Language Resources and Evaluation
Detection of Plagiarism in Programming Assignments

IEEE Transactions on Education

DeSoCoRe: detecting source code re-use across programming languages

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstration Session

Quantified Score

Hi-index	0.00

Visualization

Abstract

Internet has made available huge amounts of information, also source code. Source code repositories and, in general, programming related websites, facilitate its reuse. In this work, we propose a simple approach to the detection of cross-language source code reuse, a nearly investigated problem. Our preliminary experiments, based on character n-grams comparison, show that considering different sections of the code (i.e., comments, code, reserved words, etc.), leads to different results. When considering three programming languages: C++, Java, and Python, the best result is obtained when comments are discarded and the entire source code is considered.