Malware detection using statistical analysis of byte-level file content

  • Authors:
  • S. Momina Tabish;M. Zubair Shafiq;Muddassar Farooq

  • Affiliations:
  • National University of Computer & Emerging Sciences (FAST-NUCES), Islamabad, Pakistan;National University of Computer & Emerging Sciences (FAST-NUCES), Islamabad, Pakistan;National University of Computer & Emerging Sciences (FAST-NUCES), Islamabad, Pakistan

  • Venue:
  • Proceedings of the ACM SIGKDD Workshop on CyberSecurity and Intelligence Informatics
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Commercial anti-virus software are unable to provide protection against newly launched (a.k.a "zero-day") malware. In this paper, we propose a novel malware detection technique which is based on the analysis of byte-level file content. The novelty of our approach, compared with existing content based mining schemes, is that it does not memorize specific byte-sequences or strings appearing in the actual file content. Our technique is non-signature based and therefore has the potential to detect previously unknown and zero-day malware. We compute a wide range of statistical and information-theoretic features in a block-wise manner to quantify the byte-level file content. We leverage standard data mining algorithms to classify the file content of every block as normal or potentially malicious. Finally, we correlate the block-wise classification results of a given file to categorize it as benign or malware. Since the proposed scheme operates at the byte-level file content; therefore, it does not require any a priori information about the filetype. We have tested our proposed technique using a benign dataset comprising of six different filetypes --- DOC, EXE, JPG, MP3, PDF and ZIP and a malware dataset comprising of six different malware types --- backdoor, trojan, virus, worm, constructor and miscellaneous. We also perform a comparison with existing data mining based malware detection techniques. The results of our experiments show that the proposed nonsignature based technique surpasses the existing techniques and achieves more than 90% detection accuracy.