Kernel Methods for Pattern Analysis
Kernel Methods for Pattern Analysis
Deobfuscation: Reverse Engineering Obfuscated Code
WCRE '05 Proceedings of the 12th Working Conference on Reverse Engineering
How slow is the k-means method?
Proceedings of the twenty-second annual symposium on Computational geometry
Learning to Detect and Classify Malicious Executables in the Wild
The Journal of Machine Learning Research
Static analysis of executables to detect malicious patterns
SSYM'03 Proceedings of the 12th conference on USENIX Security Symposium - Volume 12
Static disassembly of obfuscated binaries
SSYM'04 Proceedings of the 13th conference on USENIX Security Symposium - Volume 13
Exploring Multiple Execution Paths for Malware Analysis
SP '07 Proceedings of the 2007 IEEE Symposium on Security and Privacy
Deobfuscator: An Automated Approach to the Identification and Removal of Code Obfuscation
WCRE '07 Proceedings of the 14th Working Conference on Reverse Engineering
Introduction to Information Retrieval
Introduction to Information Retrieval
Learning and Classification of Malware Behavior
DIMVA '08 Proceedings of the 5th international conference on Detection of Intrusions and Malware, and Vulnerability Assessment
A Study of the Packer Problem and Its Solutions
RAID '08 Proceedings of the 11th international symposium on Recent Advances in Intrusion Detection
Feature hashing for large scale multitask learning
ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
peHash: a novel approach to fast malware clustering
LEET'09 Proceedings of the 2nd USENIX conference on Large-scale exploits and emergent threats: botnets, spyware, worms, and more
BitShred: feature hashing malware for scalable triage and semantic analysis
Proceedings of the 18th ACM conference on Computer and communications security
Hi-index | 0.00 |
The current lack of automatic and speedy labeling of a large number (thousands) of malware samples seen everyday delays the generation of malware signatures and has become a major challenge for anti-virus industries. In this paper, we design, implement and evaluate a novel, scalable framework, called MutantX-S, that can efficiently cluster a large number of samples into families based on programs' static features, i.e., code instruction sequences. MutantX-S is a unique combination of several novel techniques to address the practical challenges of malware clustering. Specifically, it exploits the instruction format of ×86 architecture and represents a program as a sequence of opcodes, facilitating the extraction of N-gram features. It also exploits the hashing trick recently developed in the machine learning community to reduce the dimensionality of extracted feature vectors, thus significantly lowering the memory requirement and computation costs. Our comprehensive evaluation on a MutantX-S prototype using a database of more than 130,000 malware samples has shown its ability to correctly cluster over 80% of samples within 2 hours, achieving a good balance between accuracy and scalability. Applying MutantX-S on malware samples created at different times, we also demonstrate that MutantX-S achieves high accuracy in predicting labels for previously unknown malware.