VAMO: towards a fully automated malware clustering validity analysis

Authors:
Roberto Perdisci;ManChon U
Affiliations:
University of Georgia, Athens, GA;University of Georgia, Athens, GA
Venue:
Proceedings of the 28th Annual Computer Security Applications Conference
Year:
2012

Citing 14
Cited 1

Algorithms for clustering data

Algorithms for clustering data
Data clustering: a review

ACM Computing Surveys (CSUR)
On Clustering Validation Techniques

Journal of Intelligent Information Systems
Comparing clusterings---an information based distance

Journal of Multivariate Analysis
Mining specifications of malicious behavior

Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering
A Study of the Packer Problem and Its Solutions

RAID '08 Proceedings of the 11th international symposium on Recent Advances in Intrusion Detection
Characterization and evaluation of similarity measures for pairs of clusterings

Knowledge and Information Systems
Large-scale malware indexing using function-call graphs

Proceedings of the 16th ACM conference on Computer and communications security
Automated classification and analysis of internet malware

RAID'07 Proceedings of the 10th international conference on Recent advances in intrusion detection
Behavioral clustering of HTTP-based malware and signature generation using malicious network traces

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
On challenges in evaluating malware clustering

RAID'10 Proceedings of the 13th international conference on Recent advances in intrusion detection
Automatic analysis of malware behavior using machine learning

Journal of Computer Security
BitShred: feature hashing malware for scalable triage and semantic analysis

Proceedings of the 18th ACM conference on Computer and communications security
Finding non-trivial malware naming inconsistencies

ICISS'11 Proceedings of the 7th international conference on Information Systems Security

Driving in the cloud: an analysis of drive-by download operations and abuse reporting

DIMVA'13 Proceedings of the 10th international conference on Detection of Intrusions and Malware, and Vulnerability Assessment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Malware clustering is commonly applied by malware analysts to cope with the increasingly growing number of distinct malware variants collected every day from the Internet. While malware clustering systems can be useful for a variety of applications, assessing the quality of their results is intrinsically hard. In fact, clustering can be viewed as an unsupervised learning process over a dataset for which the complete ground truth is usually not available. Previous studies propose to evaluate malware clustering results by leveraging the labels assigned to the malware samples by multiple anti-virus scanners (AVs). However, the methods proposed thus far require a (semi-)manual adjustment and mapping between labels generated by different AVs, and are limited to selecting a reference sub-set of samples for which an agreement regarding their labels can be reached across a majority of AVs. This approach may bias the reference set towards "easy to cluster" malware samples, thus potentially resulting in an overoptimistic estimate of the accuracy of the malware clustering results. In this paper we propose VAMO, a system that provides a fully automated quantitative analysis of the validity of malware clustering results. Unlike previous work, VAMO does not seek a majority voting-based consensus across different AV labels, and does not discard the malware samples for which such a consensus cannot be reached. Rather, VAMO explicitly deals with the inconsistencies typical of multiple AV labels to build a more representative reference set, compared to majority voting-based approaches. Furthermore, VAMO avoids the need of a (semi-)manual mapping between AV labels from different scanners that was required in previous work. Through an extensive evaluation in a controlled setting and a real-world application, we show that VAMO outperforms majority voting-based approaches, and provides a better way for malware analysts to automatically assess the quality of their malware clustering results.