Hierarchical associative classifier (HAC) for malware detection from the large and imbalanced gray list

Authors:
Yanfang Ye;Tao Li;Kai Huang;Qingshan Jiang;Yong Chen
Affiliations:
Department of Computer Science, Xiamen University, Xiamen, People's Republic of China 361005;School of Computer Science, Florida International University, Miami, USA 33199;Software School, Xiamen University, Xiamen, People's Republic of China 361005;Software School, Xiamen University, Xiamen, People's Republic of China 361005;Anti-virus Laboratory, Kingsoft Corporation, Zhuhai, People's Republic of China 519000
Venue:
Journal of Intelligent Information Systems
Year:
2010

Citing 26
Cited 3

Computers and epidemiology

IEEE Spectrum
Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Pruning and summarizing the discovered associations

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Statistical Pattern Recognition: A Review

IEEE Transactions on Pattern Analysis and Machine Intelligence
Mining frequent patterns without candidate generation

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Mining needle in a haystack: classifying rare classes via two-phase rule induction

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Mimicry attacks on host-based intrusion detection systems

Proceedings of the 9th ACM conference on Computer and communications security
SLIQ: A Fast Scalable Classifier for Data Mining

EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
A Lazy Approach to Pruning Classification Rules

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Constraint-Based Rule Mining in Large, Dense Databases

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
Data Mining Methods for Detection of New Malicious Executables

SP '01 Proceedings of the 2001 IEEE Symposium on Security and Privacy
Mining with rarity: a unifying framework

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Learning to detect malicious executables in the wild

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning and evaluating classifiers under sample selection bias

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Static Analyzer of Vicious Executables (SAVE)

ACSAC '04 Proceedings of the 20th Annual Computer Security Applications Conference
Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy

IEEE Transactions on Pattern Analysis and Machine Intelligence
IMDS: intelligent malware detection system

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Mining specifications of malicious behavior

Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering
The class imbalance problem: A systematic study

Intelligent Data Analysis
A review of associative classification mining

The Knowledge Engineering Review
On the infeasibility of modeling polymorphic shellcode

Proceedings of the 14th ACM conference on Computer and communications security
A Novel Rule Weighting Approach in Classification Association Rule Mining

ICDMW '07 Proceedings of the Seventh IEEE International Conference on Data Mining Workshops
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Learning when training data are costly: the effect of class distribution on tree induction

Journal of Artificial Intelligence Research
A comparison of methods for multiclass support vector machines

IEEE Transactions on Neural Networks

Mal-ID: automatic malware detection using common segment analysis and meta-features

The Journal of Machine Learning Research
Using low-level dynamic attributes for malware detection based on data mining methods

MMM-ACNS'12 Proceedings of the 6th international conference on Mathematical Methods, Models and Architectures for Computer Network Security: computer network security
Discovering fuzzy association rule patterns and increasing sensitivity analysis of XML-related attacks

Journal of Network and Computer Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Nowadays, numerous attacks made by the malware (e.g., viruses, backdoors, spyware, trojans and worms) have presented a major security threat to computer users. Currently, the most significant line of defense against malware is anti-virus products which focus on authenticating valid software from a whitelist, blocking invalid software from a blacklist, and running any unknown software (i.e., the gray list) in a controlled manner. The gray list, containing unknown software programs which could be either normal or malicious, is usually authenticated or rejected manually by virus analysts. Unfortunately, along with the development of the malware writing techniques, the number of file samples in the gray list that need to be analyzed by virus analysts on a daily basis is constantly increasing. The gray list is not only large in size, but also has an imbalanced class distribution where malware is the minority class. In this paper, we describe our research effort on building automatic, effective, and interpretable classifiers resting on the analysis of Application Programming Interfaces (APIs) called by Windows Portable Executable (PE) files for detecting malware from the large and imbalanced gray list. Our effort is based on associative classifiers due to their high interpretability as well as their capability of discovering interesting relationships among API calls. We first adapt several different post-processing techniques of associative classification, including rule pruning and rule re-ordering, for building effective associative classifiers from large collections of training data. In order to help the virus analysts detect malware from the imbalanced gray list, we then develop the Hierarchical Associative Classifier (HAC). HAC constructs a two-level associative classifier to maximize precision and recall of the minority (malware) class: in the first level, it uses high precision rules of majority (benign file samples) class and low precision rules of minority class to achieve high recall; and in the second level, it ranks the minority class files and optimizes the precision. Finally, since our case studies are based on a large and real data collection obtained from the Anti-virus Lab of Kingsoft corporation, including 8,000,000 malware, 8,000,000 benign files, and 100,000 file samples from the gray list, we empirically examine the sampling strategy to build the classifiers for such a large data collection to avoid over-fitting and achieve great effectiveness as well as high efficiency. Promising experimental results demonstrate the effectiveness and efficiency of the HAC classifier. HAC has already been incorporated into the scanning tool of Kingsoft's Anti-Virus software.