Detecting code clones in binary executables

Authors:
Andreas Sæbjørnsen;Jeremiah Willcock;Thomas Panas;Daniel Quinlan;Zhendong Su
Affiliations:
University of California, Davis, Davis, CA, USA;Indiana University, Bloomington, IN, USA;Lawrence Livermore National Laboratory, Livermore, CA, USA;Lawrence Livermore National Laboratory, Livermore, CA, USA;University of California, Davis, Davis, CA, USA
Venue:
Proceedings of the eighteenth international symposium on Software testing and analysis
Year:
2009

Citing 20
Cited 8

Parameterized Duplication in Strings: Algorithms and an Application to Software Maintenance

SIAM Journal on Computing
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
CCFinder: a multilinguistic token-based code clone detection system for large scale source code

IEEE Transactions on Software Engineering
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
DMS®: Program Transformations for Practical Scalable Software Evolution

Proceedings of the 26th International Conference on Software Engineering
Testing malware detectors

ISSTA '04 Proceedings of the 2004 ACM SIGSOFT international symposium on Software testing and analysis
Clone Detection in Source Code by Frequent Itemset Techniques

SCAM '04 Proceedings of the Source Code Analysis and Manipulation, Fourth IEEE International Workshop
Semantics-Aware Malware Detection

SP '05 Proceedings of the 2005 IEEE Symposium on Security and Privacy
Detecting higher-level similarity patterns in programs

Proceedings of the 10th European software engineering conference held jointly with 13th ACM SIGSOFT international symposium on Foundations of software engineering
Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing)

Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing)
DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones

ICSE '07 Proceedings of the 29th international conference on Software Engineering
CP-Miner: a tool for finding copy-paste and related bugs in operating system code

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Static analysis of executables to detect malicious patterns

SSYM'03 Proceedings of the 12th conference on USENIX Security Symposium - Volume 12
Panorama: capturing system-wide information flow for malware detection and analysis

Proceedings of the 14th ACM conference on Computer and communications security
A dynamic birthmark for java

Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Scalable detection of semantic clones

Proceedings of the 30th international conference on Software engineering
Detecting self-mutating malware using control-flow graph matching

DIMVA'06 Proceedings of the Third international conference on Detection of Intrusions and Malware & Vulnerability Assessment
Polymorphic worm detection using structural information of executables

RAID'05 Proceedings of the 8th international conference on Recent Advances in Intrusion Detection

Finding software license violations through binary code clone detection

Proceedings of the 8th Working Conference on Mining Software Repositories
Value-based program characterization and its application to software plagiarism detection

Proceedings of the 33rd International Conference on Software Engineering
Recovering the toolchain provenance of binary code

Proceedings of the 2011 International Symposium on Software Testing and Analysis
Supervised learning for provenance-similarity of binaries

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
A first step towards algorithm plagiarism detection

Proceedings of the 2012 International Symposium on Software Testing and Analysis
Detecting encryption functions via process emulation and IL-based program analysis

ICICS'12 Proceedings of the 14th international conference on Information and Communications Security
Rendezvous: a search engine for binary code

Proceedings of the 10th Working Conference on Mining Software Repositories
Towards automatic software lineage inference

SEC'13 Proceedings of the 22nd USENIX conference on Security

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large software projects contain significant code duplication, mainly due to copying and pasting code. Many techniques have been developed to identify duplicated code to enable applications such as refactoring, detecting bugs, and protecting intellectual property. Because source code is often unavailable, especially for third-party software, finding duplicated code in binaries becomes particularly important. However, existing techniques operate primarily on source code, and no effective tool exists for binaries. In this paper, we describe the first practical clone detection algorithm for binary executables. Our algorithm extends an existing tree similarity framework based on clustering of characteristic vectors of labeled trees with novel techniques to normalize assembly instructions and to accurately and compactly model their structural information. We have implemented our technique and evaluated it on Windows XP system binaries totaling over 50 million assembly instructions. Results show that it is both scalable and precise: it analyzed Windows XP system binaries in a few hours and produced few false positives. We believe our technique is a practical, enabling technology for many applications dealing with binary code.