On challenges in evaluating malware clustering

Authors:
Peng Li;Limin Liu;Debin Gao;Michael K. Reiter
Affiliations:
Department of Computer Science, University of North Carolina, Chapel Hill, NC;State Key Lab of Information Security, Graduate School of Chinese Academy of Sciences;School of Information Systems, Singapore Management University, Singapore;Department of Computer Science, University of North Carolina, Chapel Hill, NC
Venue:
RAID'10 Proceedings of the 13th international conference on Recent advances in intrusion detection
Year:
2010

Citing 11
Cited 10

Identification of program similarity in large populations

The Computer Journal - Special issue on procedural programming
Detection of similarities in student programs: YAP'ing may be preferable to plague'ing

SIGCSE '92 Proceedings of the twenty-third SIGCSE technical symposium on Computer science education
CCFinder: a multilinguistic token-based code clone detection system for large scale source code

IEEE Transactions on Software Engineering
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)
Toward Automated Dynamic Malware Analysis Using CWSandbox

IEEE Security and Privacy
Learning and Classification of Malware Behavior

DIMVA '08 Proceedings of the 5th international conference on Detection of Intrusions and Malware, and Vulnerability Assessment
BitBlaze: A New Approach to Computer Security via Binary Analysis

ICISS '08 Proceedings of the 4th International Conference on Information Systems Security
Large-scale malware indexing using function-call graphs

Proceedings of the 16th ACM conference on Computer and communications security
Detecting Software Theft via System Call Based Birthmarks

ACSAC '09 Proceedings of the 2009 Annual Computer Security Applications Conference
Automated classification and analysis of internet malware

RAID'07 Proceedings of the 10th international conference on Recent advances in intrusion detection
Behavioral clustering of HTTP-based malware and signature generation using malicious network traces

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation

Malware images: visualization and automatic classification

Proceedings of the 8th International Symposium on Visualization for Cyber Security
Experimental challenges in cyber security: a story of provenance and lineage for malware

CSET'11 Proceedings of the 4th conference on Cyber security experimentation and test
A comparative assessment of malware classification using binary texture analysis and dynamic analysis

Proceedings of the 4th ACM workshop on Security and artificial intelligence
BitShred: feature hashing malware for scalable triage and semantic analysis

Proceedings of the 18th ACM conference on Computer and communications security
Tracking concept drift in malware families

Proceedings of the 5th ACM workshop on Security and artificial intelligence
Malware characterization using behavioral components

MMM-ACNS'12 Proceedings of the 6th international conference on Mathematical Methods, Models and Architectures for Computer Network Security: computer network security
VAMO: towards a fully automated malware clustering validity analysis

Proceedings of the 28th Annual Computer Security Applications Conference
Discovering fuzzy association rule patterns and increasing sensitivity analysis of XML-related attacks

Journal of Network and Computer Applications
Experiments with malware visualization

DIMVA'12 Proceedings of the 9th international conference on Detection of Intrusions and Malware, and Vulnerability Assessment
SigMal: a static signal processing based malware triage

Proceedings of the 29th Annual Computer Security Applications Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

Malware clustering and classification are important tools that enable analysts to prioritize their malware analysis efforts. The recent emergence of fully automated methods for malware clustering and classification that report high accuracy suggests that this problem may largely be solved. In this paper, we report the results of our attempt to confirm our conjecture that the method of selecting ground-truth data in prior evaluations biases their results toward high accuracy. To examine this conjecture, we apply clustering algorithms from a different domain (plagiarism detection), first to the dataset used in a prior work's evaluation and then to a wholly new malware dataset, to see if clustering algorithms developed without attention to subtleties of malware obfuscation are nevertheless successful. While these studies provide conflicting signals as to the correctness of our conjecture, our investigation of possible reasons uncovers, we believe, a cautionary note regarding the significance of highly accurate clustering results, as can be impacted by testing on a dataset with a biased cluster-size distribution.