MutantX-S: scalable malware clustering based on static features

Authors:
Xin Hu;Sandeep Bhatkar;Kent Griffin;Kang G. Shin
Affiliations:
IBM T.J. Waston Research Center;Symantec Research Labs;Symantec Research Labs;University of Michigan, Ann Arbor
Venue:
USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
Year:
2013

Citing 14
Cited 0

Kernel Methods for Pattern Analysis

Kernel Methods for Pattern Analysis
Deobfuscation: Reverse Engineering Obfuscated Code

WCRE '05 Proceedings of the 12th Working Conference on Reverse Engineering
How slow is the k-means method?

Proceedings of the twenty-second annual symposium on Computational geometry
Learning to Detect and Classify Malicious Executables in the Wild

The Journal of Machine Learning Research
Static analysis of executables to detect malicious patterns

SSYM'03 Proceedings of the 12th conference on USENIX Security Symposium - Volume 12
Static disassembly of obfuscated binaries

SSYM'04 Proceedings of the 13th conference on USENIX Security Symposium - Volume 13
Exploring Multiple Execution Paths for Malware Analysis

SP '07 Proceedings of the 2007 IEEE Symposium on Security and Privacy
Deobfuscator: An Automated Approach to the Identification and Removal of Code Obfuscation

WCRE '07 Proceedings of the 14th Working Conference on Reverse Engineering
Introduction to Information Retrieval

Introduction to Information Retrieval
Learning and Classification of Malware Behavior

DIMVA '08 Proceedings of the 5th international conference on Detection of Intrusions and Malware, and Vulnerability Assessment
A Study of the Packer Problem and Its Solutions

RAID '08 Proceedings of the 11th international symposium on Recent Advances in Intrusion Detection
Feature hashing for large scale multitask learning

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
peHash: a novel approach to fast malware clustering

LEET'09 Proceedings of the 2nd USENIX conference on Large-scale exploits and emergent threats: botnets, spyware, worms, and more
BitShred: feature hashing malware for scalable triage and semantic analysis

Proceedings of the 18th ACM conference on Computer and communications security

Quantified Score

Hi-index	0.00

Visualization

Abstract

The current lack of automatic and speedy labeling of a large number (thousands) of malware samples seen everyday delays the generation of malware signatures and has become a major challenge for anti-virus industries. In this paper, we design, implement and evaluate a novel, scalable framework, called MutantX-S, that can efficiently cluster a large number of samples into families based on programs' static features, i.e., code instruction sequences. MutantX-S is a unique combination of several novel techniques to address the practical challenges of malware clustering. Specifically, it exploits the instruction format of ×86 architecture and represents a program as a sequence of opcodes, facilitating the extraction of N-gram features. It also exploits the hashing trick recently developed in the machine learning community to reduce the dimensionality of extracted feature vectors, thus significantly lowering the memory requirement and computation costs. Our comprehensive evaluation on a MutantX-S prototype using a database of more than 130,000 malware samples has shown its ability to correctly cluster over 80% of samples within 2 hours, achieving a good balance between accuracy and scalability. Applying MutantX-S on malware samples created at different times, we also demonstrate that MutantX-S achieves high accuracy in predicting labels for previously unknown malware.