Detecting source code similarity using code abstraction

  • Authors:
  • Seongsoo Park;Seungcheol Ko;Jungsik Choi;Hwansoo Han;Seong-Je Cho;Jongmoo Choi

  • Affiliations:
  • Sungkyunkwan University, Suwon, Korea;Sungkyunkwan University, Suwon, Korea;Sungkyunkwan University, Suwon, Korea;Sungkyunkwan University, Suwon, Korea;Dankook University, Yongin, Korea;Dankook University, Yongin, Korea

  • Venue:
  • Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Various approaches have been proposed to develop effective methods to measure program similarity. Even commercial tools and freeware tools are available for measuring program similarity based on source code comparison. These tools are quite useful to handle small to middle scale software products, but limited for large scale software products. In addition, these tools may report similarity measures with less credentials for the source code either obfuscated by malicious users or generated by automatic program template generation tools. To handle large scale software, more drastic measures should be provided. In this paper, we propose an automatic abstraction method to summarize source code. We eliminate a large portion of source code which is less relevant to similarity comparison. With this abstraction, our similarity comparison method can provide more robust measures for obfuscation and automatic code generation. We evaluate our abstraction method by running through source comparison tool --- MOSS, a web-based similarity detection tool. According to our experiment with multiple versions of Apache HTTP server, Putty SSH client, and Lighttpd server, our abstraction method reports quite reliable results with abstracted source code, which are only 23--35% of original source code. As the execution time for pattern match is linearly proportional to the length of the source code, our method can reduce the execution time as much as the percentage of source code reduction.