Grammar and model extraction for security applications using dynamic program binary analysis

  • Authors:
  • Dawn Song;Juan Caballero Bayerri

  • Affiliations:
  • Carnegie Mellon University;Carnegie Mellon University

  • Venue:
  • Grammar and model extraction for security applications using dynamic program binary analysis
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this thesis we develop techniques for analyzing security-relevant functionality in a program that do not require access to the program's source code, only to its binary form. Such techniques are needed to analyze closed-source programs such as commercial-off-the-shelf applications and malware, which are prevalent in computer systems. Our techniques are dynamic: they extract information from executions of the program. Dynamic techniques are precise because they can examine the exact run-time behavior of the program, without the approximations that static analysis requires. In particular, we develop dynamic program binary analysis techniques to address three problems: protocol reverse-engineering, binary code reuse, and model extraction. We demonstrate our techniques on a variety of security applications including active botnet infiltration, deviation detection, attack generation, vulnerability-based signature generation, and vulnerability discovery. Protocol reverse-engineering techniques infer the grammar of undocumented program inputs, such as network protocols and file formats. Such grammars are important for applications like network monitoring, signature generation, or botnet infiltration. When no specification is available, rich information about the protocol or file format can be reversed from a program that implements it. We develop a new approach to protocol reverse-engineering based on dynamic program binary analysis. Our approach reverses the format and semantics of protocol messages by monitoring how an implementation of the protocol processes them. To demonstrate our techniques, we extract the grammar of the previously undocumented C&C protocol used by MegaD, a prevalent spam botnet. Binary code reuse techniques make a code fragment from a program binary reusable by external source code. We propose a novel approach to automatic binary code reuse that identifies the interface of a binary code fragment and extracts its instructions and data dependencies. The extracted code is self-contained and independent of the rest of the functionality in the program. To demonstrate our techniques, we use them to extract proprietary cryptographic routines used by malware and show how those routines enable infiltrating botnets that use encrypted protocols. Model extraction techniques build a model of the functionality of a code fragment. Closed-source programs often contain undocumented, yet security-relevant, functionality such as filters or proprietary algorithms. To reason about the security properties of such functionality we develop model extraction techniques that work directly on program binaries. To produce models with high coverage, we extend previous dynamic symbolic execution techniques to programs that use string operations, programs that parse highly structured inputs, and programs that use complex functions like encryption or checksums. We demonstrate the utility of our techniques to discover vulnerabilities in malware and use the extracted models to automatically find subtle content-sniffing XSS attacks on Web applications, to identify deviations between different implementations of the same functionality, and to generate signatures for vulnerabilities in software.