On the Effectiveness of Simhash for Detecting Near-Miss Clones in Large Scale Software Systems

Authors:
Md. Sharif Uddin;Chanchal K. Roy;Kevin A. Schneider;Abram Hindle
Affiliations:
-;-;-;-
Venue:
WCRE '11 Proceedings of the 2011 18th Working Conference on Reverse Engineering
Year:
2011

Citing 0
Cited 2

On how often code is cloned across repositories

Proceedings of the 34th International Conference on Software Engineering
The fingerprint analysis technique-oriented research on microblog for public opinion analysis

Proceedings of the Fifth International Conference on Internet Multimedia Computing and Service

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clone detection techniques essentially cluster textually, syntactically and/or semantically similar code fragments in or across software systems. For large datasets, similarity identification is costly both in terms of time and memory, and especially so when detecting near-miss clones where lines could be modified, added and/or deleted in the copied fragments. The capability and effectiveness of a clone detection tool mostly depends on the code similarity measurement technique it uses. A variety of similarity measurement approaches have been used for clone detection, including fingerprint based approaches, which have had varying degrees of success notwithstanding some limitations. In this paper, we investigate the effectiveness of simhash, a state of the art fingerprint based data similarity measurement technique for detecting both exact and near-miss clones in large scale software systems. Our experimental data show that simhash is indeed effective in identifying various types of clones in a software system despite wide variations in experimental circumstances. The approach is also suitable as a core capability for building other tools, such as tools for: incremental clone detection, code searching, and clone management.