A scalable multi-level feature extraction technique to detect malicious executables

Authors:
Mohammad M. Masud;Latifur Khan;Bhavani Thuraisingham
Affiliations:
Department of Computer Science, The University of Texas at Dallas, Richardson, USA 75080;Department of Computer Science, The University of Texas at Dallas, Richardson, USA 75083-0688;Department of Computer Science, The University of Texas at Dallas, Richardson, USA 75083-0688
Venue:
Information Systems Frontiers
Year:
2008

Citing 11
Cited 6

A training algorithm for optimal margin classifiers

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
Machine Learning

Machine Learning
MEF: Malicious Email Filter - A UNIX Mail Filter That Detects Malicious Windows Executables

Proceedings of the FREENIX Track: 2001 USENIX Annual Technical Conference
Data Mining Methods for Detection of New Malicious Executables

SP '01 Proceedings of the 2001 IEEE Symposium on Security and Privacy
Learning to detect malicious executables in the wild

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Polygraph: Automatically Generating Signatures for Polymorphic Worms

SP '05 Proceedings of the 2005 IEEE Symposium on Security and Privacy
A Method for Detecting Obfuscated Calls in Malicious Binaries

IEEE Transactions on Software Engineering
PolyUnpack: Automating the Hidden-Code Extraction of Unpack-Executing Malware

ACSAC '06 Proceedings of the 22nd Annual Computer Security Applications Conference
Data Structures and Algorithms in Java

Data Structures and Algorithms in Java
Autograph: toward automated, distributed worm signature detection

SSYM'04 Proceedings of the 13th conference on USENIX Security Symposium - Volume 13
Feature based techniques for auto-detection of novel email worms

PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining

A survey of data mining techniques for malware detection using file features

Proceedings of the 46th Annual Southeast Regional Conference on XX
Exploiting an antivirus interface

Computer Standards & Interfaces
A mining technique using n-grams and motion transcripts for body sensor network data repository

WH '10 Wireless Health 2010
Cloud-based malware detection for evolving data streams

ACM Transactions on Management Information Systems (TMIS)
Using low-level dynamic attributes for malware detection based on data mining methods

MMM-ACNS'12 Proceedings of the 6th international conference on Mathematical Methods, Models and Architectures for Computer Network Security: computer network security
Design and Implementation of a Data Mining System for Malware Detection

Journal of Integrated Design & Process Science

Quantified Score

Hi-index	0.02

Visualization

Abstract

We present a scalable and multi-level feature extraction technique to detect malicious executables. We propose a novel combination of three different kinds of features at different levels of abstraction. These are binary n-grams, assembly instruction sequences, and Dynamic Link Library (DLL) function calls; extracted from binary executables, disassembled executables, and executable headers, respectively. We also propose an efficient and scalable feature extraction technique, and apply this technique on a large corpus of real benign and malicious executables. The above mentioned features are extracted from the corpus data and a classifier is trained, which achieves high accuracy and low false positive rate in detecting malicious executables. Our approach is knowledge-based because of several reasons. First, we apply the knowledge obtained from the binary n-gram features to extract assembly instruction sequences using our Assembly Feature Retrieval algorithm. Second, we apply the statistical knowledge obtained during feature extraction to select the best features, and to build a classification model. Our model is compared against other feature-based approaches for malicious code detection, and found to be more efficient in terms of detection accuracy and false alarm rate.