File cloning in open source Java projects: The good, the bad, and the ugly

  • Authors:
  • Joel Ossher;Hitesh Sajnani;Cristina Lopes

  • Affiliations:
  • Donald Bren School of Information and Computer Sciences, University of California, Irvine, 92697-3425, USA;Donald Bren School of Information and Computer Sciences, University of California, Irvine, 92697-3425, USA;Donald Bren School of Information and Computer Sciences, University of California, Irvine, 92697-3425, USA

  • Venue:
  • ICSM '11 Proceedings of the 2011 27th IEEE International Conference on Software Maintenance
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a study of the extent to which developers copy entire files or sets of files into their applications with little or no modification. Our aim is to determine the prevalence of such activity within open source Java development, and to identify the circumstances under which files are reused in this manner. To accomplish this aim, we developed a novel method of file-level code clone detection that is scalable to millions of files. We applied our method to the Sourcerer Repository, which contains over 13,000 Java projects aggregated from multiple open source repositories. Our method detected that in excess of 10% of files are clones, and that over 15% of all projects contain at least one cloned file. In addition to computing these raw numbers, we manually examined a large number of the reported clones. We found the most commonly cloned files to be Java extension classes and popular third-party libraries, both large and small. We also discovered a number of projects that occur in multiple online repositories, have been forked, or were divided into multiple subprojects.