Detection of malicious code by applying machine learning classifiers on static features: A state-of-the-art survey

Authors:
Asaf Shabtai;Robert Moskovitch;Yuval Elovici;Chanan Glezer
Affiliations:
Deutsche Telekom Laboratories at Ben-Gurion University, Ben-Gurion University, Be'er Sheva 84105, Israel;Deutsche Telekom Laboratories at Ben-Gurion University, Ben-Gurion University, Be'er Sheva 84105, Israel;Deutsche Telekom Laboratories at Ben-Gurion University, Ben-Gurion University, Be'er Sheva 84105, Israel;Deutsche Telekom Laboratories at Ben-Gurion University, Ben-Gurion University, Be'er Sheva 84105, Israel
Venue:
Information Security Tech. Report
Year:
2009

Citing 31
Cited 11

Evidential reasoning using stochastic simulation of causal models

Artificial Intelligence
Instance-Based Learning Algorithms

Machine Learning
Original Contribution: Stacked generalization

Neural Networks
C4.5: programs for machine learning

C4.5: programs for machine learning
Very Simple Classification Rules Perform Well on Most Commonly Used Datasets

Machine Learning
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

Machine Learning - Special issue on learning with probabilistic representations
On Combining Classifiers

IEEE Transactions on Pattern Analysis and Machine Intelligence
Making large-scale support vector machine learning practical

Advances in kernel methods
A vector space model for automatic indexing

Communications of the ACM
Adaptive Fraud Detection

Data Mining and Knowledge Discovery
Rule Induction with CN2: Some Recent Improvements

EWSL '91 Proceedings of the European Working Session on Machine Learning
Classification by Voting Feature Intervals

ECML '97 Proceedings of the 9th European Conference on Machine Learning
Toward Optimal Active Learning through Sampling Estimation of Error Reduction

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
A Brief Introduction to Boosting

IJCAI '99 Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
Random decision forests

ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1) - Volume 1
Data Mining Methods for Detection of New Malicious Executables

SP '01 Proceedings of the 2001 IEEE Symposium on Security and Privacy
Support vector machine active learning with applications to text classification

The Journal of Machine Learning Research
Recent worms: a survey and trends

Proceedings of the 2003 ACM workshop on Rapid malcode
Is Combining Classifiers with Stacking Better than Selecting the Best One?

Machine Learning
Testing malware detectors

ISSTA '04 Proceedings of the 2004 ACM SIGSOFT international symposium on Software testing and analysis
Editorial: special issue on learning from imbalanced data sets

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Learning to detect malicious executables in the wild

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
N-Gram-Based Detection of New Malicious Code

COMPSAC '04 Proceedings of the 28th Annual International Computer Software and Applications Conference - Workshops and Fast Abstracts - Volume 02
Malware prevalence in the KaZaA file-sharing network

Proceedings of the 6th ACM SIGCOMM conference on Internet measurement
A Feature Selection and Evaluation Scheme for Computer Virus Detection

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Learning to Detect and Classify Malicious Executables in the Wild

The Journal of Machine Learning Research
The class imbalance problem: A systematic study

Intelligent Data Analysis
Opcodes as predictor for malware

International Journal of Electronic Security and Digital Forensics
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Estimating continuous distributions in Bayesian classifiers

UAI'95 Proceedings of the Eleventh conference on Uncertainty in artificial intelligence
Malicious codes detection based on ensemble learning

ATC'07 Proceedings of the 4th international conference on Autonomic and Trusted Computing

Editorial: A representative bibliography of surveys in the information fusion domain

Information Fusion
Reducing dimensionality in a database of sleep EEG arousals

Expert Systems with Applications: An International Journal
Crowdroid: behavior-based malware detection system for Android

Proceedings of the 1st ACM workshop on Security and privacy in smartphones and mobile devices
"Andromaly": a behavioral malware detection framework for android devices

Journal of Intelligent Information Systems
A graph mining approach for detecting unknown malwares

Journal of Visual Languages and Computing
A comparative study of malware family classification

ICICS'12 Proceedings of the 14th international conference on Information and Communications Security
Opcode sequences as representation of executables for data-mining-based unknown malware detection

Information Sciences: an International Journal
Editorial: Guest editorial: Special issue on data mining for information security

Information Sciences: an International Journal
Analyzing and defending against web-based malware

ACM Computing Surveys (CSUR)
POSTER: Detecting malware through temporal function-based features

Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security
A survey of multiple classifier systems as hybrid systems

Information Fusion

Quantified Score

Hi-index	0.00

Visualization

Abstract

This research synthesizes a taxonomy for classifying detection methods of new malicious code by Machine Learning (ML) methods based on static features extracted from executables. The taxonomy is then operationalized to classify research on this topic and pinpoint critical open research issues in light of emerging threats. The article addresses various facets of the detection challenge, including: file representation and feature selection methods, classification algorithms, weighting ensembles, as well as the imbalance problem, active learning, and chronological evaluation. From the survey we conclude that a framework for detecting new malicious code in executable files can be designed to achieve very high accuracy while maintaining low false positives (i.e. misclassifying benign files as malicious). The framework should include training of multiple classifiers on various types of features (mainly OpCode and byte n-grams and Portable Executable Features), applying weighting algorithm on the classification results of the individual classifiers, as well as an active learning mechanism to maintain high detection accuracy. The training of classifiers should also consider the imbalance problem by generating classifiers that will perform accurately in a real-life situation where the percentage of malicious files among all files is estimated to be approximately 10%.