Efficient token based clone detection with flexible tokenization

Authors:
Hamid Abdul Basit;Simon J. Puglisi;William F. Smyth;Andrew Turpin;Stan Jarzabek
Affiliations:
Lahore University of Management Sciences, Lahore, Pakistan;Curtin University of Technology;McMaster University;RMIT University;National University of Singapore, Singapore
Venue:
The 6th Joint Meeting on European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering: companion papers
Year:
2007

Citing 16
Cited 0

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
CCFinder: a multilinguistic token-based code clone detection system for large scale source code

IEEE Transactions on Software Engineering
Experiment on the Automatic Detection of Function Clones in a Software System Using Metrics

ICSM '96 Proceedings of the 1996 International Conference on Software Maintenance
Evaluating Clone Detection Tools for Use during Preventative Maintenance

SCAM '02 Proceedings of the Second IEEE International Workshop on Source Code Analysis and Manipulation
Identifying Similar Code with Program Dependence Graphs

WCRE '01 Proceedings of the Eighth Working Conference on Reverse Engineering (WCRE'01)
Assessing the Benefits of Incorporating Function Clone Detection in a Development Process

ICSM '97 Proceedings of the International Conference on Software Maintenance
Clone Detection Using Abstract Syntax Trees

ICSM '98 Proceedings of the International Conference on Software Maintenance
A Language Independent Approach for Detecting Duplicated Code

ICSM '99 Proceedings of the IEEE International Conference on Software Maintenance
Eliminating redundancies with a "composition with adaptation" meta-programming technique

Proceedings of the 9th European software engineering conference held jointly with 11th ACM SIGSOFT international symposium on Foundations of software engineering
Identifying redundancy in source code using fingerprints

CASCON '93 Proceedings of the 1993 conference of the Centre for Advanced Studies on Collaborative research: software engineering - Volume 1
Replacing suffix trees with enhanced suffix arrays

Journal of Discrete Algorithms - SPIRE 2002
Beyond templates: a study of clones in the STL and some general implications

Proceedings of the 27th international conference on Software engineering
Detecting higher-level similarity patterns in programs

Proceedings of the 10th European software engineering conference held jointly with 13th ACM SIGSOFT international symposium on Foundations of software engineering
CP-Miner: Finding Copy-Paste and Related Bugs in Large-Scale Software Code

IEEE Transactions on Software Engineering
Clone Detection Using Abstract Syntax Suffix Trees

WCRE '06 Proceedings of the 13th Working Conference on Reverse Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Code clones are similar code fragments that occur at multiple locations in a software system. Detection of code clones provides useful information for maintenance, reengineering, program understanding and reuse. Several techniques have been proposed to detect code clones. These techniques differ in the code representation used for analysis of clones, ranging from plain text to parse trees and program dependence graphs. Clone detection based on lexical tokens involves minimal code transformation and gives good results, but is computationally expensive because of the large number of tokens that need to be compared. We explored string algorithms to find suitable data structures and algorithms for efficient token based clone detection and implemented them in our tool Repeated Tokens Finder (RTF). Instead of using suffix tree for string matching, we use more memory efficient suffix array. RTF incorporates a suffix array based linear time algorithm to detect string matches. It also provides a simple and customizable tokenization mechanism. Initial analysis and experiments show that our clone detection is simple, scalable, and performs better than the previous well-known tools.