Comparison of feature selection and classification algorithms in identifying malicious executables

  • Authors:
  • D. Michael Cai;Maya Gokhale;James Theiler

  • Affiliations:
  • Space Data Systems Group, Los Alamos National Laboratory, Los Alamos, NM 87545, USA;Advanced Computing Laboratory, Los Alamos National Laboratory, Los Alamos, NM 87545, USA;Space and Remote Sensing Sciences Group, Los Alamos National laboratory, Los Alamos, NM 87545, USA

  • Venue:
  • Computational Statistics & Data Analysis
  • Year:
  • 2007

Quantified Score

Hi-index 0.03

Visualization

Abstract

Malicious executables, often spread as email attachments, impose serious security threats to computer systems and associated networks. We investigated the use of byte sequence frequencies as a way to automatically distinguish malicious from benign executables without actually executing them. In a series of experiments, we compared classification accuracies over seven feature selection methods, four classification algorithms, and variable byte sequence lengths. We found that single-byte patterns provided surprisingly reliable features to separate malicious executables from benign. Between classifiers and feature selection methods, the overall performance of the models depended more on the choice of classifier than the method of feature selection. Support vector machine (SVM) classifiers were found to be superior in terms of prediction accuracy, training time, and aversion to overfitting.