A Framework for Studying Clones In Large Software Systems

Authors:
Zhen Ming Jiang;Ahmed E. Hassan
Affiliations:
University of Victoria, Canada;University of Victoria, Canada
Venue:
SCAM '07 Proceedings of the Seventh IEEE International Working Conference on Source Code Analysis and Manipulation
Year:
2007

Citing 0
Cited 8

An automated approach for abstracting execution logs to execution events

Journal of Software Maintenance and Evolution: Research and Practice - Special Issue on Program Comprehension through Dynamic Analysis (PCODA)
An information retrieval process to aid in the analysis of code clones

Empirical Software Engineering
An evaluation of code similarity identification for the grow-and-prune model

Journal of Software Maintenance and Evolution: Research and Practice - Special Issue on the 12th Conference on Software Maintenance and Reengineering (CSMR 2008)
Near-miss function clones in open source software: an empirical study

Journal of Software Maintenance and Evolution: Research and Practice - Working Conference on Reverse Engineering (WCRE 2008)
Extracting code clones for refactoring using combinations of clone metrics

Proceedings of the 5th International Workshop on Software Clones
Automated type-3 clone oracle using Levenshtein metric

Proceedings of the 5th International Workshop on Software Clones
Representing clones in a localized manner

Proceedings of the 5th International Workshop on Software Clones
Live scatterplots

Proceedings of the 5th International Workshop on Software Clones

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clones are code segments that have been created by copying-and-pasting from other code segments. Clones occur often in large software systems. It is reported that 5 to 50% of the source code of a large software system is cloned. A major challenge when studying code cloning in large software systems is handling the large amount of clone candidates produced by leading edge clone detection tools. For example, the CCFinder, clone detection tool, produces over 7 million pairs of clone candidates for the Linux kernel (which consists of over 4 MLOC). Moreover, the output of clone detection tools grows rapidly as a software system evolves. Researchers and developers need tools to help them study the large amount of clone data in order to better understand the clone phenomena in large systems. In this paper, we propose a data mining framework to help researchers cope with the large amount of data produced by clone detection tools. We propose techniques to reduce, abstract and highlight the most interesting data produced by clone detection tools. Our framework also introduces a visualization tool which allows users to query and explore clone data at various abstraction levels. We demonstrate our framework on a case study of the clone phenomena in the Linux kernel.