Detecting code clones in binary executables

  • Authors:
  • Andreas Sæbjørnsen;Jeremiah Willcock;Thomas Panas;Daniel Quinlan;Zhendong Su

  • Affiliations:
  • University of California, Davis, Davis, CA, USA;Indiana University, Bloomington, IN, USA;Lawrence Livermore National Laboratory, Livermore, CA, USA;Lawrence Livermore National Laboratory, Livermore, CA, USA;University of California, Davis, Davis, CA, USA

  • Venue:
  • Proceedings of the eighteenth international symposium on Software testing and analysis
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Large software projects contain significant code duplication, mainly due to copying and pasting code. Many techniques have been developed to identify duplicated code to enable applications such as refactoring, detecting bugs, and protecting intellectual property. Because source code is often unavailable, especially for third-party software, finding duplicated code in binaries becomes particularly important. However, existing techniques operate primarily on source code, and no effective tool exists for binaries. In this paper, we describe the first practical clone detection algorithm for binary executables. Our algorithm extends an existing tree similarity framework based on clustering of characteristic vectors of labeled trees with novel techniques to normalize assembly instructions and to accurately and compactly model their structural information. We have implemented our technique and evaluated it on Windows XP system binaries totaling over 50 million assembly instructions. Results show that it is both scalable and precise: it analyzed Windows XP system binaries in a few hours and produced few false positives. We believe our technique is a practical, enabling technology for many applications dealing with binary code.