Tuning research tools for scalability and performance: The NiCad experience

Authors:
James R. Cordy;Chanchal K. Roy
Affiliations:
School of Computing, Queen's University, Kingston, Ontario, Canada;Department of Computer Science, University of Saskatchewan, Saskatoon, Saskatchewan, Canada
Venue:
Science of Computer Programming
Year:
2014

Citing 25
Cited 3

Algorithms for the Longest Common Subsequence Problem

Journal of the ACM (JACM)
CCFinder: a multilinguistic token-based code clone detection system for large scale source code

IEEE Transactions on Software Engineering
Experiment on the Automatic Detection of Function Clones in a Software System Using Metrics

ICSM '96 Proceedings of the 1996 International Conference on Software Maintenance
Parallel Support for Source Code Analysis and Modification

SCAM '02 Proceedings of the Second IEEE International Workshop on Source Code Analysis and Manipulation
Clone Detection Using Abstract Syntax Trees

ICSM '98 Proceedings of the International Conference on Software Maintenance
Syntactic Approximation Using Iterative Lexical Analysis

IWPC '03 Proceedings of the 11th IEEE International Workshop on Program Comprehension
Comprehending Reality " Practical Barriers to Industrial Adoption of Software Maintenance Automation

IWPC '03 Proceedings of the 11th IEEE International Workshop on Program Comprehension
Practical language-independent detection of near-miss clones

CASCON '04 Proceedings of the 2004 conference of the Centre for Advanced Studies on Collaborative research
CP-Miner: Finding Copy-Paste and Related Bugs in Large-Scale Software Code

IEEE Transactions on Software Engineering
The TXL source transformation language

Science of Computer Programming - The fourth workshop on language descriptions, tools, and applications (LDTA'04)
Clone Detection Using Abstract Syntax Suffix Trees

WCRE '06 Proceedings of the 13th Working Conference on Reverse Engineering
Comparison and Evaluation of Clone Detection Tools

IEEE Transactions on Software Engineering
Scenario-Based Comparison of Clone Detection Techniques

ICPC '08 Proceedings of the 2008 The 16th IEEE International Conference on Program Comprehension
NICAD: Accurate Detection of Near-Miss Intentional Clones Using Flexible Pretty-Printing and Code Normalization

ICPC '08 Proceedings of the 2008 The 16th IEEE International Conference on Program Comprehension
"Cloning considered harmful" considered harmful: patterns of cloning in software

Empirical Software Engineering
Comparison and evaluation of code clone detection techniques and tools: A qualitative approach

Science of Computer Programming
Incremental Clone Detection

CSMR '09 Proceedings of the 2009 European Conference on Software Maintenance and Reengineering
A Mutation/Injection-Based Automatic Framework for Evaluating Code Clone Detection Tools

ICSTW '09 Proceedings of the IEEE International Conference on Software Testing, Verification, and Validation Workshops
Do code clones matter?

ICSE '09 Proceedings of the 31st International Conference on Software Engineering
Code siblings: Technical and legal implications of copying code between applications

MSR '09 Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories
Near-miss function clones in open source software: an empirical study

Journal of Software Maintenance and Evolution: Research and Practice - Working Conference on Reverse Engineering (WCRE 2008)
Are scripting languages really different?

Proceedings of the 4th International Workshop on Software Clones
Exploring Large-Scale System Similarity Using Incremental Clone Detection and Live Scatterplots

ICPC '11 Proceedings of the 2011 IEEE 19th International Conference on Program Comprehension
DebCheck: Efficient Checking for Open Source Code Clones in Software Systems

ICPC '11 Proceedings of the 2011 IEEE 19th International Conference on Program Comprehension
The NiCad Clone Detector

ICPC '11 Proceedings of the 2011 IEEE 19th International Conference on Program Comprehension

Guest editors' introduction to the 4th issue of Experimental Software and Toolkits (EST-4)

Science of Computer Programming
Using clone detection to find malware in acrobat files

CASCON '13 Proceedings of the 2013 Conference of the Center for Advanced Studies on Collaborative Research
Genealogical insights into the facts and fictions of clone removal

ACM SIGAPP Applied Computing Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clone detection is a research technique for analyzing software systems for similarities, with applications in software understanding, maintenance, evolution, license enforcement and many other issues. The NiCad near-miss clone detection method has been shown to yield highly accurate results in both precision and recall. However, its naive two-step method, involving a parsing first step to identify and normalize code fragments, followed by a text line-based second step using longest common subsequence (LCS) to compare fragments, has proven difficult to migrate to the efficiency and scalability required for large scale research applications. Rather than presenting the NiCad tool itself in detail, this paper focuses on our experience in migrating NiCad from an initial rapid prototype to a practical scalable research tool. The process has increased overall performance by a factor of up to 40 and clone detection speed by a factor of over 400, while reducing memory and processor requirements to fit on a standard laptop. We apply a sequence of four different kinds of performance optimizations and analyze the effect of each optimization in detail. We believe that the lessons of our experience in migrating NiCad from research prototype to production performance may be beneficial to others who are facing a similar problem.