Towards the detection of cross-language source code reuse

  • Authors:
  • Enrique Flores;Alberto Barrón-Cedeño;Paolo Rosso;Lidia Moreno

  • Affiliations:
  • Dpto. de Sistemas Informáticos y Computación, Universidad Politécnica de Valencia;Dpto. de Sistemas Informáticos y Computación, Universidad Politécnica de Valencia;Dpto. de Sistemas Informáticos y Computación, Universidad Politécnica de Valencia;Dpto. de Sistemas Informáticos y Computación, Universidad Politécnica de Valencia

  • Venue:
  • NLDB'11 Proceedings of the 16th international conference on Natural language processing and information systems
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Internet has made available huge amounts of information, also source code. Source code repositories and, in general, programming related websites, facilitate its reuse. In this work, we propose a simple approach to the detection of cross-language source code reuse, a nearly investigated problem. Our preliminary experiments, based on character n-grams comparison, show that considering different sections of the code (i.e., comments, code, reserved words, etc.), leads to different results. When considering three programming languages: C++, Java, and Python, the best result is obtained when comments are discarded and the entire source code is considered.