Constructing universal version history

  • Authors:
  • Hung-Fu Chang;Audris Mockus

  • Affiliations:
  • University of Southern California, Los Angeles, CA;Avaya Labs Research, Basking Ridge, NJ

  • Venue:
  • Proceedings of the 2006 international workshop on Mining software repositories
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Developers often copy code for parts or entire products to start a new product or a new release. In order to understand the software change history and to determine the code authorship, we propose to construct a universal version history from multiple version control repositories. To that end we create two practical code copy detection methods at the level of the source code file: prefix-postfix algorithm and prefix algorithm. The full pathname of a file and its version history are used to construct the universal version history of a file by linking together change histories of files that had the same code at any point in the past. The assumption of both algorithms is that developers often duplicate files by copying entire directories. Once the copying is identified we propose an algorithm to link version histories from multiple repositories in order to construct universal version history. The results show that about 41.32% of source files (in the repository involving more than 6M versions of around 2M files) were duplicated among the Avaya's source code repositories for more than ten different projects. The prefix-postfix algorithm is more suitable than prefix algorithm due to the reasonable error rates after validation of the known copying behaviors.