Large-scale machine learning-based malware detection: confronting the "10-fold cross validation" scheme with reality

Authors:
Kevin Allix;Tegawendé F. Bissyandé;Quentin Jérome;Jacques Klein;Radu State;Yves Le Traon
Affiliations:
University of Luxembourg, Luxembourg, Luxembourg;University of Luxembourg, Luxembourg, Luxembourg;University of Luxembourg, Luxembourg, Luxembourg;University of Luxembourg, Luxembourg, Luxembourg;University of Luxembourg, Luxembourg, Luxembourg;University of Luxembourg, Luxembourg, Luxembourg
Venue:
Proceedings of the 4th ACM conference on Data and application security and privacy
Year:
2014

Citing 8
Cited 0

Classification of malware using structured control flow

AusPDC '10 Proceedings of the Eighth Australasian Symposium on Parallel and Distributed Computing - Volume 107
A study of android application security

SEC'11 Proceedings of the 20th USENIX conference on Security
A survey of mobile malware in the wild

Proceedings of the 1st ACM workshop on Security and privacy in smartphones and mobile devices
Dissecting Android Malware: Characterization and Evolution

SP '12 Proceedings of the 2012 IEEE Symposium on Security and Privacy
DroidMat: Android Malware Detection through Manifest and API Calls Tracing

ASIAJCIS '12 Proceedings of the 2012 Seventh Asia Joint Conference on Information Security
On the feasibility of online malware detection with performance counters

Proceedings of the 40th Annual International Symposium on Computer Architecture
A New Android Malware Detection Approach Using Bayesian Classification

AINA '13 Proceedings of the 2013 IEEE 27th International Conference on Advanced Information Networking and Applications
A Classifier of Malicious Android Applications

ARES '13 Proceedings of the 2013 International Conference on Availability, Reliability and Security

Quantified Score

Hi-index	0.00

Visualization

Abstract

To address the issue of malware detection, researchers have recently started to investigate the capabilities of machine-learning techniques for proposing effective approaches. Several promising results were recorded in the literature, many approaches being assessed with the common "10-Fold cross validation" scheme. This paper revisits the purpose of malware detection to discuss the adequacy of the "10-Fold" scheme for validating techniques that may not perform well in reality. To this end, we have devised several Machine Learning classifiers that rely on a novel set of features built from applications' CFGs. We use a sizeable dataset of over 50,000 Android applications collected from sources where state-of-the art approaches have selected their data. We show that our approach outperforms existing machine learning-based approaches. However, this high performance on usual-size datasets does not translate in high performance in the wild.