Finding Similarities in Source Code Through Factorization

Authors:
Michel Chilowicz;Étienne Duris;Gilles Roussel
Affiliations:
Université Paris-Est, Laboratoire d'Informatique de l'Institut Gaspard-Monge, UMR CNRS 8049, 5 Bd Descartes, 77454 Marne-la-Vallée Cedex 2, France;Université Paris-Est, Laboratoire d'Informatique de l'Institut Gaspard-Monge, UMR CNRS 8049, 5 Bd Descartes, 77454 Marne-la-Vallée Cedex 2, France;Université Paris-Est, Laboratoire d'Informatique de l'Institut Gaspard-Monge, UMR CNRS 8049, 5 Bd Descartes, 77454 Marne-la-Vallée Cedex 2, France
Venue:
Electronic Notes in Theoretical Computer Science (ENTCS)
Year:
2009

Citing 14
Cited 2

Interprocedural slicing using dependence graphs

ACM Transactions on Programming Languages and Systems (TOPLAS)
Reducing the space requirement of suffix trees

Software—Practice & Experience
Simple and fast linear space computation of longest common subsequences

Information Processing Letters
Software for detecting suspected plagiarism: comparing structure and attribute-counting systems

ACSE '96 Proceedings of the 1st Australasian conference on Computer science education
Evaluating Clone Detection Tools for Use during Preventative Maintenance

SCAM '02 Proceedings of the Second IEEE International Workshop on Source Code Analysis and Manipulation
On finding duplication and near-duplication in large software systems

WCRE '95 Proceedings of the Second Working Conference on Reverse Engineering
Identifying Similar Code with Program Dependence Graphs

WCRE '01 Proceedings of the Eighth Working Conference on Reverse Engineering (WCRE'01)
Clone Detection Using Abstract Syntax Trees

ICSM '98 Proceedings of the International Conference on Software Maintenance
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Efficient randomized pattern-matching algorithms

IBM Journal of Research and Development - Mathematics and computing
Introduction to Algorithms, Third Edition

Introduction to Algorithms, Third Edition
Computer algorithms for plagiarism detection

IEEE Transactions on Education
A universal algorithm for sequential data compression

IEEE Transactions on Information Theory
Shared information and program plagiarism detection

IEEE Transactions on Information Theory

Towards a multi-scale approach for source code approximate match report

Proceedings of the 4th International Workshop on Software Clones
Viewing functions as token sequences to highlight similarities in source code

Science of Computer Programming

Quantified Score

Hi-index	0.01

Visualization

Abstract

The high availability of a huge number of documents on the Web makes plagiarism very attractive and easy. This plagiarism concerns any kind of document, natural language texts as well as more structured information such as programs. In order to cope with this problem, many tools and algorithms have been proposed to find similarities. In this paper we present a new algorithm designed to detect similarities in source codes. Contrary to existing methods, this algorithm relies on the notion of function and focuses on obfuscation with inlining and outlining of functions. This method is also efficient against insertions, deletions and permutations of instruction blocks. It is based on code factorization and uses adapted pattern matching algorithms and structures such as suffix arrays.