VILO: a rapid learning nearest-neighbor classifier for malware triage

Authors:
Arun Lakhotia;Andrew Walenstein;Craig Miles;Anshuman Singh
Affiliations:
Center for Advanced Computer Studies, University of Louisiana at Lafayette, Lafayette, USA;School of Computing and Informatics, University of Louisiana at Lafayette, Lafayette, USA;Center for Advanced Computer Studies, University of Louisiana at Lafayette, Lafayette, USA;Center for Advanced Computer Studies, University of Louisiana at Lafayette, Lafayette, USA
Venue:
Journal in Computer Virology
Year:
2013

Citing 24
Cited 0

Operating system protection through program evolution

Computers and Security
Exploring the similarity space

ACM SIGIR Forum
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
Data Mining Methods for Detection of New Malicious Executables

SP '01 Proceedings of the 2001 IEEE Symposium on Security and Privacy
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Learning to detect malicious executables in the wild

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
N-Gram-Based Detection of New Malicious Code

COMPSAC '04 Proceedings of the 28th Annual International Computer Software and Applications Conference - Workshops and Fast Abstracts - Volume 02
Semantics-Aware Malware Detection

SP '05 Proceedings of the 2005 IEEE Symposium on Security and Privacy
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
Program element matching for multi-version program analyses

Proceedings of the 2006 international workshop on Mining software repositories
Using engine signature to detect metamorphic malware

Proceedings of the 4th ACM workshop on Recurring malcode
Learning to Detect and Classify Malicious Executables in the Wild

The Journal of Machine Learning Research
On the use of ROC analysis for the optimization of abstaining classifiers

Machine Learning
Statistical signatures for fast filtering of instruction-substituting metamorphic malware

Proceedings of the 2007 ACM workshop on Recurring malcode
Unknown Malcode Detection Using OPCODE Representation

EuroISI '08 Proceedings of the 1st European Conference on Intelligence and Security Informatics
Sensitivity Analysis of k-Fold Cross Validation in Prediction Error Estimation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Automated classification and analysis of internet malware

RAID'07 Proceedings of the 10th international conference on Recent advances in intrusion detection
Hunting for undetectable metamorphic viruses

Journal in Computer Virology
BitShred: feature hashing malware for scalable triage and semantic analysis

Proceedings of the 18th ACM conference on Computer and communications security
Malware classification based on call graph clustering

Journal in Computer Virology
Polymorphic worm detection using structural information of executables

RAID'05 Proceedings of the 8th international conference on Recent Advances in Intrusion Detection
Opcode graph similarity and metamorphic detection

Journal in Computer Virology
Chi-squared distance and metamorphic virus detection

Journal in Computer Virology

Quantified Score

Hi-index	0.00

Visualization

Abstract

VILO is a lazy learner system designed for malware classification and triage. It implements a nearest neighbor (NN) algorithm with similarities computed over Term Frequency $$\times $$ Inverse Document Frequency (TFIDF) weighted opcode mnemonic permutation features (N-perms). Being an NN-classifier, VILO makes minimal structural assumptions about class boundaries, and thus is well suited for the constantly changing malware population. This paper presents an extensive study of application of VILO in malware analysis. Our experiments demonstrate that (a) VILO is a rapid learner of malware families, i.e., VILO's learning curve stabilizes at high accuracies quickly (training on less than 20 variants per family is sufficient); (b) similarity scores derived from TDIDF weighted features should primarily be treated as ordinal measurements; and (c) VILO with N-perm feature vectors outperforms traditional N-gram feature vectors when used to classify real-world malware into their respective families.