CCFinder: a multilinguistic token-based code clone detection system for large scale source code
IEEE Transactions on Software Engineering
Software Quality Analysis by Code Clones in Industrial Legacy Software
METRICS '02 Proceedings of the 8th International Symposium on Software Metrics
On finding duplication and near-duplication in large software systems
WCRE '95 Proceedings of the Second Working Conference on Reverse Engineering
Assessing the Benefits of Incorporating Function Clone Detection in a Development Process
ICSM '97 Proceedings of the International Conference on Software Maintenance
Clone Detection Using Abstract Syntax Trees
ICSM '98 Proceedings of the International Conference on Software Maintenance
A Language Independent Approach for Detecting Duplicated Code
ICSM '99 Proceedings of the IEEE International Conference on Software Maintenance
Improved Tool Support for the Investigation of Duplication in Software
ICSM '05 Proceedings of the 21st IEEE International Conference on Software Maintenance
Large-Scale Code Reuse in Open Source Software
FLOSS '07 Proceedings of the First International Workshop on Emerging Trends in FLOSS Research and Development
Evaluation of source code copy detection methods on freebsd
Proceedings of the 2008 international working conference on Mining software repositories
Risky files: an approach to focus quality improvement effort
Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering
Hi-index | 0.00 |
Developers often copy code for parts or entire products to start a new product or a new release. In order to understand the software change history and to determine the code authorship, we propose to construct a universal version history from multiple version control repositories. To that end we create two practical code copy detection methods at the level of the source code file: prefix-postfix algorithm and prefix algorithm. The full pathname of a file and its version history are used to construct the universal version history of a file by linking together change histories of files that had the same code at any point in the past. The assumption of both algorithms is that developers often duplicate files by copying entire directories. Once the copying is identified we propose an algorithm to link version histories from multiple repositories in order to construct universal version history. The results show that about 41.32% of source files (in the repository involving more than 6M versions of around 2M files) were duplicated among the Avaya's source code repositories for more than ten different projects. The prefix-postfix algorithm is more suitable than prefix algorithm due to the reasonable error rates after validation of the known copying behaviors.