Differentiating code from data in x86 binaries

Authors:
Richard Wartell;Yan Zhou;Kevin W. Hamlen;Murat Kantarcioglu;Bhavani Thuraisingham
Affiliations:
Computer Science Department, University of Texas at Dallas, Richardson, TX;Computer Science Department, University of Texas at Dallas, Richardson, TX;Computer Science Department, University of Texas at Dallas, Richardson, TX;Computer Science Department, University of Texas at Dallas, Richardson, TX;Computer Science Department, University of Texas at Dallas, Richardson, TX
Venue:
ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part III
Year:
2011

Citing 9
Cited 4

Data compression using dynamic Markov modelling

The Computer Journal
Compression and Coding Algorithms

Compression and Coding Algorithms
A compression-based algorithm for Chinese word segmentation

Computational Linguistics
Spam Filtering Using Statistical Data Compression Models

The Journal of Machine Learning Research
Detours: binary interception of Win32 functions

WINSYM'99 Proceedings of the 3rd conference on USENIX Windows NT Symposium - Volume 3
Jakstab: A Static Analysis Platform for Binaries

CAV '08 Proceedings of the 20th international conference on Computer Aided Verification
The IDA Pro Book: The Unofficial Guide to the World's Most Popular Disassembler

The IDA Pro Book: The Unofficial Guide to the World's Most Popular Disassembler
An Abstract Interpretation-Based Framework for Control Flow Reconstruction from Binaries

VMCAI '09 Proceedings of the 10th International Conference on Verification, Model Checking, and Abstract Interpretation
CodeSurfer/x86—A platform for analyzing x86 executables

CC'05 Proceedings of the 14th international conference on Compiler Construction

An architecture for Concordia

Proceedings of the Seventh Annual Workshop on Cyber Security and Information Intelligence Research
Binary stirring: self-randomizing instruction addresses of legacy x86 binary code

Proceedings of the 2012 ACM conference on Computer and communications security
Securing untrusted code via compiler-agnostic binary rewriting

Proceedings of the 28th Annual Computer Security Applications Conference
Locating executable fragments with Concordia, a scalable, semantics-based architecture

Proceedings of the Eighth Annual Cyber Security and Information Intelligence Research Workshop

Quantified Score

Hi-index	0.00

Visualization

Abstract

Robust, static disassembly is an important part of achieving high coverage for many binary code analyses, such as reverse engineering, malware analysis, reference monitor in-lining, and software fault isolation. However, one of the major difficulties current disassemblers face is differentiating code from data when they are interleaved. This paper presents a machine learning-based disassembly algorithm that segments an x86 binary into subsequences of bytes and then classifies each subsequence as code or data. The algorithm builds a language model from a set of pre-tagged binaries using a statistical data compression technique. It sequentially scans a new binary executable and sets a breaking point at each potential code-to-code and code-to-data/data-to-code transition. The classification of each segment as code or data is based on the minimum cross-entropy. Experimental results are presented to demonstrate the effectiveness of the algorithm.