Hierarchical associative classifier (HAC) for malware detection from the large and imbalanced gray list

  • Authors:
  • Yanfang Ye;Tao Li;Kai Huang;Qingshan Jiang;Yong Chen

  • Affiliations:
  • Department of Computer Science, Xiamen University, Xiamen, People's Republic of China 361005;School of Computer Science, Florida International University, Miami, USA 33199;Software School, Xiamen University, Xiamen, People's Republic of China 361005;Software School, Xiamen University, Xiamen, People's Republic of China 361005;Anti-virus Laboratory, Kingsoft Corporation, Zhuhai, People's Republic of China 519000

  • Venue:
  • Journal of Intelligent Information Systems
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Nowadays, numerous attacks made by the malware (e.g., viruses, backdoors, spyware, trojans and worms) have presented a major security threat to computer users. Currently, the most significant line of defense against malware is anti-virus products which focus on authenticating valid software from a whitelist, blocking invalid software from a blacklist, and running any unknown software (i.e., the gray list) in a controlled manner. The gray list, containing unknown software programs which could be either normal or malicious, is usually authenticated or rejected manually by virus analysts. Unfortunately, along with the development of the malware writing techniques, the number of file samples in the gray list that need to be analyzed by virus analysts on a daily basis is constantly increasing. The gray list is not only large in size, but also has an imbalanced class distribution where malware is the minority class. In this paper, we describe our research effort on building automatic, effective, and interpretable classifiers resting on the analysis of Application Programming Interfaces (APIs) called by Windows Portable Executable (PE) files for detecting malware from the large and imbalanced gray list. Our effort is based on associative classifiers due to their high interpretability as well as their capability of discovering interesting relationships among API calls. We first adapt several different post-processing techniques of associative classification, including rule pruning and rule re-ordering, for building effective associative classifiers from large collections of training data. In order to help the virus analysts detect malware from the imbalanced gray list, we then develop the Hierarchical Associative Classifier (HAC). HAC constructs a two-level associative classifier to maximize precision and recall of the minority (malware) class: in the first level, it uses high precision rules of majority (benign file samples) class and low precision rules of minority class to achieve high recall; and in the second level, it ranks the minority class files and optimizes the precision. Finally, since our case studies are based on a large and real data collection obtained from the Anti-virus Lab of Kingsoft corporation, including 8,000,000 malware, 8,000,000 benign files, and 100,000 file samples from the gray list, we empirically examine the sampling strategy to build the classifiers for such a large data collection to avoid over-fitting and achieve great effectiveness as well as high efficiency. Promising experimental results demonstrate the effectiveness and efficiency of the HAC classifier. HAC has already been incorporated into the scanning tool of Kingsoft's Anti-Virus software.