Data compression using dynamic Markov modelling
The Computer Journal
Compression and Coding Algorithms
Compression and Coding Algorithms
A compression-based algorithm for Chinese word segmentation
Computational Linguistics
Spam Filtering Using Statistical Data Compression Models
The Journal of Machine Learning Research
Detours: binary interception of Win32 functions
WINSYM'99 Proceedings of the 3rd conference on USENIX Windows NT Symposium - Volume 3
Jakstab: A Static Analysis Platform for Binaries
CAV '08 Proceedings of the 20th international conference on Computer Aided Verification
The IDA Pro Book: The Unofficial Guide to the World's Most Popular Disassembler
The IDA Pro Book: The Unofficial Guide to the World's Most Popular Disassembler
An Abstract Interpretation-Based Framework for Control Flow Reconstruction from Binaries
VMCAI '09 Proceedings of the 10th International Conference on Verification, Model Checking, and Abstract Interpretation
CodeSurfer/x86—A platform for analyzing x86 executables
CC'05 Proceedings of the 14th international conference on Compiler Construction
Proceedings of the Seventh Annual Workshop on Cyber Security and Information Intelligence Research
Binary stirring: self-randomizing instruction addresses of legacy x86 binary code
Proceedings of the 2012 ACM conference on Computer and communications security
Securing untrusted code via compiler-agnostic binary rewriting
Proceedings of the 28th Annual Computer Security Applications Conference
Locating executable fragments with Concordia, a scalable, semantics-based architecture
Proceedings of the Eighth Annual Cyber Security and Information Intelligence Research Workshop
Hi-index | 0.00 |
Robust, static disassembly is an important part of achieving high coverage for many binary code analyses, such as reverse engineering, malware analysis, reference monitor in-lining, and software fault isolation. However, one of the major difficulties current disassemblers face is differentiating code from data when they are interleaved. This paper presents a machine learning-based disassembly algorithm that segments an x86 binary into subsequences of bytes and then classifies each subsequence as code or data. The algorithm builds a language model from a set of pre-tagged binaries using a statistical data compression technique. It sequentially scans a new binary executable and sets a breaking point at each potential code-to-code and code-to-data/data-to-code transition. The classification of each segment as code or data is based on the minimum cross-entropy. Experimental results are presented to demonstrate the effectiveness of the algorithm.