A study of detecting computer viruses in real-infected files in the n-gram representation with machine learning methods

Authors:
Thomas Stibor
Affiliations:
Fakultät für Informatik, Technische Universität München
Venue:
IEA/AIE'10 Proceedings of the 23rd international conference on Industrial engineering and other applications of applied intelligent systems - Volume Part I
Year:
2010

Citing 8
Cited 0

A training algorithm for optimal margin classifiers

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
Support-Vector Networks

Machine Learning
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond

Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond
Data Mining Methods for Detection of New Malicious Executables

SP '01 Proceedings of the 2001 IEEE Symposium on Security and Privacy
Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)

Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)
Comparison of feature selection and classification algorithms in identifying malicious executables

Computational Statistics & Data Analysis
Learning to Detect and Classify Malicious Executables in the Wild

The Journal of Machine Learning Research
Introduction to Information Retrieval

Introduction to Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Machine learning methods were successfully applied in recent years for detecting new and unseen computer viruses. The viruses were, however, detected in small virus loader files and not in real infected executable files. We created data sets of benign files, virus loader files and real infected executable files and represented the data as collections of n-grams. Our results indicate that detecting viruses in real infected executable files with machine learning methods is nearly impossible in the n-gram representation. This statement is underpinned by exploring the n-gram representation from an information theoretic perspective and empirically by performing classification experiments with machine learning methods.