Algorithms for clustering data
Algorithms for clustering data
ACM Computing Surveys (CSUR)
On Clustering Validation Techniques
Journal of Intelligent Information Systems
Comparing clusterings---an information based distance
Journal of Multivariate Analysis
Mining specifications of malicious behavior
Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering
A Study of the Packer Problem and Its Solutions
RAID '08 Proceedings of the 11th international symposium on Recent Advances in Intrusion Detection
Characterization and evaluation of similarity measures for pairs of clusterings
Knowledge and Information Systems
Large-scale malware indexing using function-call graphs
Proceedings of the 16th ACM conference on Computer and communications security
Automated classification and analysis of internet malware
RAID'07 Proceedings of the 10th international conference on Recent advances in intrusion detection
Behavioral clustering of HTTP-based malware and signature generation using malicious network traces
NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
On challenges in evaluating malware clustering
RAID'10 Proceedings of the 13th international conference on Recent advances in intrusion detection
Automatic analysis of malware behavior using machine learning
Journal of Computer Security
BitShred: feature hashing malware for scalable triage and semantic analysis
Proceedings of the 18th ACM conference on Computer and communications security
Finding non-trivial malware naming inconsistencies
ICISS'11 Proceedings of the 7th international conference on Information Systems Security
Driving in the cloud: an analysis of drive-by download operations and abuse reporting
DIMVA'13 Proceedings of the 10th international conference on Detection of Intrusions and Malware, and Vulnerability Assessment
Hi-index | 0.00 |
Malware clustering is commonly applied by malware analysts to cope with the increasingly growing number of distinct malware variants collected every day from the Internet. While malware clustering systems can be useful for a variety of applications, assessing the quality of their results is intrinsically hard. In fact, clustering can be viewed as an unsupervised learning process over a dataset for which the complete ground truth is usually not available. Previous studies propose to evaluate malware clustering results by leveraging the labels assigned to the malware samples by multiple anti-virus scanners (AVs). However, the methods proposed thus far require a (semi-)manual adjustment and mapping between labels generated by different AVs, and are limited to selecting a reference sub-set of samples for which an agreement regarding their labels can be reached across a majority of AVs. This approach may bias the reference set towards "easy to cluster" malware samples, thus potentially resulting in an overoptimistic estimate of the accuracy of the malware clustering results. In this paper we propose VAMO, a system that provides a fully automated quantitative analysis of the validity of malware clustering results. Unlike previous work, VAMO does not seek a majority voting-based consensus across different AV labels, and does not discard the malware samples for which such a consensus cannot be reached. Rather, VAMO explicitly deals with the inconsistencies typical of multiple AV labels to build a more representative reference set, compared to majority voting-based approaches. Furthermore, VAMO avoids the need of a (semi-)manual mapping between AV labels from different scanners that was required in previous work. Through an extensive evaluation in a controlled setting and a real-world application, we show that VAMO outperforms majority voting-based approaches, and provides a better way for malware analysts to automatically assess the quality of their malware clustering results.