An evaluation of code similarity identification for the grow-and-prune model

  • Authors:
  • Thilo Mende;Rainer Koschke;Felix Beckwermert

  • Affiliations:
  • University of Bremen, Fachbereich 3, Postfach 33 04 40, 28334 Bremen, Germany;University of Bremen, Fachbereich 3, Postfach 33 04 40, 28334 Bremen, Germany;University of Bremen, Fachbereich 3, Postfach 33 04 40, 28334 Bremen, Germany

  • Venue:
  • Journal of Software Maintenance and Evolution: Research and Practice - Special Issue on the 12th Conference on Software Maintenance and Reengineering (CSMR 2008)
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

In case new functionality is required, which is similar to the existing one, developers often copy the code that implements the existing functionality and adjust the copy to the new requirements. The result of the copying is code growth. If developers face maintenance problems, because of the need to make changes multiple times for the original and all its copies, they may decide to merge the original and its copies again; that is, they prune the code. This approach was named the grow-and-prune model by Faust and Verhoef. This paper describes tool support for the grow-and-prune model in the evolution of software by identifying similar functions that may be merged. These functions are identified in two steps. First, token-based clone detection is used to detect pairs of functions sharing code. Second, Levenshtein distance (LD) measures the textual similarity among these functions. Sufficient similarity at function level is then lifted to the architectural level. The approach is evaluated by a case study for the Linux kernel. We give examples of instances of the grow-and-prune model for Linux. Then, we evaluate our technique quantitatively by measuring recall and precision with respect to an oracle. To obtain the oracle, we asked nine different developers to decide whether they believe certain functions are similar and should be merged. The evaluation shows that the recall and precision of our technique are about 75%. Calculating LD on token values rather than characters is superior. The two metrics strongly correlate but the token-based calculation reduces runtime by a factor of 4.6. Clone detection is an effective filter to reduce the number of calculations of the relatively expensive LD. Copyright © 2009 John Wiley & Sons, Ltd.