Parameterized Duplication in Strings: Algorithms and an Application to Software Maintenance
SIAM Journal on Computing
Approximate nearest neighbors: towards removing the curse of dimensionality
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
CCFinder: a multilinguistic token-based code clone detection system for large scale source code
IEEE Transactions on Software Engineering
Similarity Search in High Dimensions via Hashing
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Winnowing: local algorithms for document fingerprinting
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
DMS®: Program Transformations for Practical Scalable Software Evolution
Proceedings of the 26th International Conference on Software Engineering
ISSTA '04 Proceedings of the 2004 ACM SIGSOFT international symposium on Software testing and analysis
Clone Detection in Source Code by Frequent Itemset Techniques
SCAM '04 Proceedings of the Source Code Analysis and Manipulation, Fourth IEEE International Workshop
Semantics-Aware Malware Detection
SP '05 Proceedings of the 2005 IEEE Symposium on Security and Privacy
Detecting higher-level similarity patterns in programs
Proceedings of the 10th European software engineering conference held jointly with 13th ACM SIGSOFT international symposium on Foundations of software engineering
Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing)
Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing)
DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones
ICSE '07 Proceedings of the 29th international conference on Software Engineering
CP-Miner: a tool for finding copy-paste and related bugs in operating system code
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Static analysis of executables to detect malicious patterns
SSYM'03 Proceedings of the 12th conference on USENIX Security Symposium - Volume 12
Panorama: capturing system-wide information flow for malware detection and analysis
Proceedings of the 14th ACM conference on Computer and communications security
Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Scalable detection of semantic clones
Proceedings of the 30th international conference on Software engineering
Detecting self-mutating malware using control-flow graph matching
DIMVA'06 Proceedings of the Third international conference on Detection of Intrusions and Malware & Vulnerability Assessment
Polymorphic worm detection using structural information of executables
RAID'05 Proceedings of the 8th international conference on Recent Advances in Intrusion Detection
Finding software license violations through binary code clone detection
Proceedings of the 8th Working Conference on Mining Software Repositories
Value-based program characterization and its application to software plagiarism detection
Proceedings of the 33rd International Conference on Software Engineering
Recovering the toolchain provenance of binary code
Proceedings of the 2011 International Symposium on Software Testing and Analysis
Supervised learning for provenance-similarity of binaries
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
A first step towards algorithm plagiarism detection
Proceedings of the 2012 International Symposium on Software Testing and Analysis
Detecting encryption functions via process emulation and IL-based program analysis
ICICS'12 Proceedings of the 14th international conference on Information and Communications Security
Rendezvous: a search engine for binary code
Proceedings of the 10th Working Conference on Mining Software Repositories
Towards automatic software lineage inference
SEC'13 Proceedings of the 22nd USENIX conference on Security
Hi-index | 0.00 |
Large software projects contain significant code duplication, mainly due to copying and pasting code. Many techniques have been developed to identify duplicated code to enable applications such as refactoring, detecting bugs, and protecting intellectual property. Because source code is often unavailable, especially for third-party software, finding duplicated code in binaries becomes particularly important. However, existing techniques operate primarily on source code, and no effective tool exists for binaries. In this paper, we describe the first practical clone detection algorithm for binary executables. Our algorithm extends an existing tree similarity framework based on clustering of characteristic vectors of labeled trees with novel techniques to normalize assembly instructions and to accurately and compactly model their structural information. We have implemented our technique and evaluated it on Windows XP system binaries totaling over 50 million assembly instructions. Results show that it is both scalable and precise: it analyzed Windows XP system binaries in a few hours and produced few false positives. We believe our technique is a practical, enabling technology for many applications dealing with binary code.