CCFinder: a multilinguistic token-based code clone detection system for large scale source code
IEEE Transactions on Software Engineering
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Clone Detection Using Abstract Syntax Trees
ICSM '98 Proceedings of the International Conference on Software Maintenance
An empirical study of code clone genealogies
Proceedings of the 10th European software engineering conference held jointly with 13th ACM SIGSOFT international symposium on Foundations of software engineering
ICSE '07 Proceedings of the 29th international conference on Software Engineering
Comparison and Evaluation of Clone Detection Tools
IEEE Transactions on Software Engineering
Comparison and evaluation of code clone detection techniques and tools: A qualitative approach
Science of Computer Programming
Language-Independent Clone Detection Applied to Plagiarism Detection
SCAM '10 Proceedings of the 2010 10th IEEE Working Conference on Source Code Analysis and Manipulation
Index-based code clone detection: incremental, distributed, scalable
ICSM '10 Proceedings of the 2010 IEEE International Conference on Software Maintenance
On the Effectiveness of Simhash for Detecting Near-Miss Clones in Large Scale Software Systems
WCRE '11 Proceedings of the 2011 18th Working Conference on Reverse Engineering
Internet-scale Real-time Code Clone Search Via Multi-level Indexing
WCRE '11 Proceedings of the 2011 18th Working Conference on Reverse Engineering
Hot clones: combining search-driven development, clone management, and code provenance
Proceedings of the 34th International Conference on Software Engineering
Extensions during software evolution: do objects meet their promise?
ECOOP'12 Proceedings of the 26th European conference on Object-Oriented Programming
How do developers react to API deprecation?: the case of a smalltalk ecosystem
Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering
WEON: towards a software ecosystem ONtology
Proceedings of the 2013 International Workshop on Ecosystem Architectures
Software trustworthiness 2.0-A semantic web enabled global source code analysis approach
Journal of Systems and Software
Hi-index | 0.00 |
Detecting code duplication in large code bases, or even across project boundaries, is problematic due to the massive amount of data involved. Large-scale clone detection also opens new challenges beyond asking for the provenance of a single clone fragment, such as assessing the prevalence of code clones on the entire code base, and their evolution. We propose a set of lightweight techniques that may scale up to very large amounts of source code in the presence of multiple versions. The common idea behind these techniques is to use bad hashing to get a quick answer. We report on a case study, the Squeaksource ecosystem, which features thousands of software projects, with more than 40 million versions of methods, across more than seven years of evolution. We provide estimates for the prevalence of type-1, type-2, and type-3 clones in Squeaksource.