On how often code is cloned across repositories

Authors:
Niko Schwarz;Mircea Lungu;Romain Robbes
Affiliations:
University of Bern, Switzerland;University of Bern, Switzerland;University of Chile, Chile
Venue:
Proceedings of the 34th International Conference on Software Engineering
Year:
2012

Citing 11
Cited 5

CCFinder: a multilinguistic token-based code clone detection system for large scale source code

IEEE Transactions on Software Engineering
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Clone Detection Using Abstract Syntax Trees

ICSM '98 Proceedings of the International Conference on Software Maintenance
An empirical study of code clone genealogies

Proceedings of the 10th European software engineering conference held jointly with 13th ACM SIGSOFT international symposium on Foundations of software engineering
Very-Large Scale Code Clone Analysis and Visualization of Open Source Programs Using Distributed CCFinder: D-CCFinder

ICSE '07 Proceedings of the 29th international conference on Software Engineering
Comparison and Evaluation of Clone Detection Tools

IEEE Transactions on Software Engineering
Comparison and evaluation of code clone detection techniques and tools: A qualitative approach

Science of Computer Programming
Language-Independent Clone Detection Applied to Plagiarism Detection

SCAM '10 Proceedings of the 2010 10th IEEE Working Conference on Source Code Analysis and Manipulation
Index-based code clone detection: incremental, distributed, scalable

ICSM '10 Proceedings of the 2010 IEEE International Conference on Software Maintenance
On the Effectiveness of Simhash for Detecting Near-Miss Clones in Large Scale Software Systems

WCRE '11 Proceedings of the 2011 18th Working Conference on Reverse Engineering
Internet-scale Real-time Code Clone Search Via Multi-level Indexing

WCRE '11 Proceedings of the 2011 18th Working Conference on Reverse Engineering

Hot clones: combining search-driven development, clone management, and code provenance

Proceedings of the 34th International Conference on Software Engineering
Extensions during software evolution: do objects meet their promise?

ECOOP'12 Proceedings of the 26th European conference on Object-Oriented Programming
How do developers react to API deprecation?: the case of a smalltalk ecosystem

Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering
WEON: towards a software ecosystem ONtology

Proceedings of the 2013 International Workshop on Ecosystem Architectures
Software trustworthiness 2.0-A semantic web enabled global source code analysis approach

Journal of Systems and Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

Detecting code duplication in large code bases, or even across project boundaries, is problematic due to the massive amount of data involved. Large-scale clone detection also opens new challenges beyond asking for the provenance of a single clone fragment, such as assessing the prevalence of code clones on the entire code base, and their evolution. We propose a set of lightweight techniques that may scale up to very large amounts of source code in the presence of multiple versions. The common idea behind these techniques is to use bad hashing to get a quick answer. We report on a case study, the Squeaksource ecosystem, which features thousands of software projects, with more than 40 million versions of methods, across more than seven years of evolution. We provide estimates for the prevalence of type-1, type-2, and type-3 clones in Squeaksource.