Learning to Detect and Classify Malicious Executables in the Wild

Authors:
J. Zico Kolter;Marcus A. Maloof
Affiliations:
-;-
Venue:
The Journal of Machine Learning Research
Year:
2006

Citing 27
Cited 54

Detecting plagiarism in student Pascal programs

The Computer Journal
Instance-Based Learning Algorithms

Machine Learning
A training algorithm for optimal margin classifiers

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
C4.5: programs for machine learning

C4.5: programs for machine learning
Discrimination of authorship using visualization

Information Processing and Management: an International Journal
Software forensics: can we track code to its authors?

Computers and Security
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

Machine Learning - Special issue on learning with probabilistic representations
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Fast training of support vector machines using sequential minimal optimization

Advances in kernel methods
On Relevance, Probabilistic Indexing and Information Retrieval

Journal of the ACM (JACM)
Statistical Pattern Recognition: A Review

IEEE Transactions on Pattern Analysis and Machine Intelligence
Explicitly representing expected cost: an alternative to ROC representation

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization

Machine Learning
Robust Classification for Imprecise Environments

Machine Learning
Principles of data mining

Principles of data mining
Machine Learning

Machine Learning
An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants

Machine Learning
Maximum Security

Maximum Security
Attacking Malicious Code: A Report to the Infosec Research Council

IEEE Software
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Data Mining Methods for Detection of New Malicious Executables

SP '01 Proceedings of the 2001 IEEE Symposium on Security and Privacy
Learning to detect malicious executables in the wild

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Detecting malicious java code using virtual machine auditing

SSYM'03 Proceedings of the 12th conference on USENIX Security Symposium - Volume 12
Static analysis of executables to detect malicious patterns

SSYM'03 Proceedings of the 12th conference on USENIX Security Symposium - Volume 12
Biologically inspired defenses against computer viruses

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1

Network intrusion detection: Evaluating cluster, discriminant, and logit analysis

Information Sciences: an International Journal
Machine Learning for Computer Security

The Journal of Machine Learning Research
Detection of unknown computer worms based on behavioral classification of the host

Computational Statistics & Data Analysis
Classification of packed executables for accurate computer virus detection

Pattern Recognition Letters
Learning and Classification of Malware Behavior

DIMVA '08 Proceedings of the 5th international conference on Detection of Intrusions and Malware, and Vulnerability Assessment
Malware detection using adaptive data compression

Proceedings of the 1st ACM workshop on Workshop on AISec
Unknown Malcode Detection Using OPCODE Representation

EuroISI '08 Proceedings of the 1st European Conference on Intelligence and Security Informatics
Improving malware detection by applying multi-inducer ensemble

Computational Statistics & Data Analysis
A Chronological Evaluation of Unknown Malcode Detection

PAISI '09 Proceedings of the Pacific Asia Workshop on Intelligence and Security Informatics
Detection of malicious code by applying machine learning classifiers on static features: A state-of-the-art survey

Information Security Tech. Report
Malicious Code Detection Using Active Learning

Privacy, Security, and Trust in KDD
Applying randomized projection to aid prediction algorithms in detecting high-dimensional rogue applications

Proceedings of the 47th Annual Southeast Regional Conference
Large-scale malware indexing using function-call graphs

Proceedings of the 16th ACM conference on Computer and communications security
Automated classification and analysis of internet malware

RAID'07 Proceedings of the 10th international conference on Recent advances in intrusion detection
Extracting compiler provenance from program binaries

Proceedings of the 9th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering
Fast malware classification by automated behavioral graph matching

Proceedings of the Sixth Annual Workshop on Cyber Security and Information Intelligence Research
Aiding prediction algorithms in detecting high-dimensional malicious applications using a randomized projection technique

Proceedings of the 48th Annual Southeast Regional Conference
Pattern recognition techniques for the classification of malware packers

ACISP'10 Proceedings of the 15th Australasian conference on Information security and privacy
A study of detecting computer viruses in real-infected files in the n-gram representation with machine learning methods

IEA/AIE'10 Proceedings of the 23rd international conference on Industrial engineering and other applications of applied intelligent systems - Volume Part I
Improved call graph comparison using simulated annealing

Proceedings of the 2011 ACM Symposium on Applied Computing
A new N-gram feature extraction-selection method for malicious code

ICANNGA'11 Proceedings of the 10th international conference on Adaptive and natural computing algorithms - Volume Part II
Recovering the toolchain provenance of binary code

Proceedings of the 2011 International Symposium on Software Testing and Analysis
Using randomized projection techniques to aid in detecting high-dimensional malicious applications

Proceedings of the 49th Annual Southeast Regional Conference
A supervised topic transition model for detecting malicious system call sequences

Proceedings of the 2011 workshop on Knowledge discovery, modeling and simulation
Who wrote this code? identifying the authors of program binaries

ESORICS'11 Proceedings of the 16th European conference on Research in computer security
BitShred: feature hashing malware for scalable triage and semantic analysis

Proceedings of the 18th ACM conference on Computer and communications security
Run-time malware detection based on positive selection

Journal in Computer Virology
FORECAST: skimming off the malware cream

Proceedings of the 27th Annual Computer Security Applications Conference
Applying random projection to the classification of malicious applications using data mining algorithms

Proceedings of the 50th Annual Southeast Regional Conference
Feature reduction to speed up malware classification

NordSec'11 Proceedings of the 16th Nordic conference on Information Security Technology for Applications
Mal-ID: automatic malware detection using common segment analysis and meta-features

The Journal of Machine Learning Research
A classifier based on minimum circum circle

ICSI'12 Proceedings of the Third international conference on Advances in Swarm Intelligence - Volume Part II
Improving malware classification: bridging the static/dynamic gap

Proceedings of the 5th ACM workshop on Security and artificial intelligence
Tracking concept drift in malware families

Proceedings of the 5th ACM workshop on Security and artificial intelligence
A fine-grained classification approach for the packed malicious code

ICICS'12 Proceedings of the 14th international conference on Information and Communications Security
Discriminant malware distance learning on structuralinformation for automated malware classification

Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems
Using file relationships in malware classification

DIMVA'12 Proceedings of the 9th international conference on Detection of Intrusions and Malware, and Vulnerability Assessment
Juxtapp: a scalable system for detecting code reuse among android applications

DIMVA'12 Proceedings of the 9th international conference on Detection of Intrusions and Malware, and Vulnerability Assessment
Zero-day malware detection based on supervised learning algorithms of API call signatures

AusDM '11 Proceedings of the Ninth Australasian Data Mining Conference - Volume 121
Discriminant malware distance learning on structural information for automated malware classification

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Applying static analysis to high-dimensional malicious application detection

Proceedings of the 51st ACM Southeast Conference
Detecting malicious behaviour using supervised learning algorithms of the function calls

International Journal of Electronic Security and Digital Forensics
VILO: a rapid learning nearest-neighbor classifier for malware triage

Journal in Computer Virology
Detecting machine-morphed malware variants via engine attribution

Journal in Computer Virology
Malware detection by pruning of parallel ensembles using harmony search

Pattern Recognition Letters
A close look on n-grams in intrusion detection: anomaly detection vs. classification

Proceedings of the 2013 ACM workshop on Artificial intelligence and security
DUET: integration of dynamic and static analyses for malware clustering with cluster ensembles

Proceedings of the 29th Annual Computer Security Applications Conference
SigMal: a static signal processing based malware triage

Proceedings of the 29th Annual Computer Security Applications Conference
Exploring discriminatory features for automated malware classification

DIMVA'13 Proceedings of the 10th international conference on Detection of Intrusions and Malware, and Vulnerability Assessment
Towards automatic software lineage inference

SEC'13 Proceedings of the 22nd USENIX conference on Security
ExecScent: mining for new C&C domains in live networks with adaptive control protocol templates

SEC'13 Proceedings of the 22nd USENIX conference on Security
MutantX-S: scalable malware clustering based on static features

USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
Design and Implementation of a Data Mining System for Malware Detection

Journal of Integrated Design & Process Science
Detection of cross site scripting attack in wireless networks using n-Gram and SVM

Mobile Information Systems - Advances in Network-Based Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe the use of machine learning and data mining to detect and classify malicious executables as they appear in the wild. We gathered 1,971 benign and 1,651 malicious executables and encoded each as a training example using n-grams of byte codes as features. Such processing resulted in more than 255 million distinct n-grams. After selecting the most relevant n-grams for prediction, we evaluated a variety of inductive methods, including naive Bayes, decision trees, support vector machines, and boosting. Ultimately, boosted decision trees outperformed other methods with an area under the ROC curve of 0.996. Results suggest that our methodology will scale to larger collections of executables. We also evaluated how well the methods classified executables based on the function of their payload, such as opening a backdoor and mass-mailing. Areas under the ROC curve for detecting payload function were in the neighborhood of 0.9, which were smaller than those for the detection task. However, we attribute this drop in performance to fewer training examples and to the challenge of obtaining properly labeled examples, rather than to a failing of the methodology or to some inherent difficulty of the classification task. Finally, we applied detectors to 291 malicious executables discovered after we gathered our original collection, and boosted decision trees achieved a true-positive rate of 0.98 for a desired false-positive rate of 0.05. This result is particularly important, for it suggests that our methodology could be used as the basis for an operational system for detecting previously undiscovered malicious executables.