GP-Fileprints: file types detection using genetic programming

  • Authors:
  • Ahmed Kattan;Edgar Galván-López;Riccardo Poli;Michael O'Neill

  • Affiliations:
  • School of Computer Science and Electronic Engineering, University of Essex, Colchester, UK;Natural Computing Research & Applications Group, University College Dublin, Ireland;School of Computer Science and Electronic Engineering, University of Essex, Colchester, UK;Natural Computing Research & Applications Group, University College Dublin, Ireland

  • Venue:
  • EuroGP'10 Proceedings of the 13th European conference on Genetic Programming
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

We propose a novel application of Genetic Programming (GP): the identification of file types via the analysis of raw binary streams (i.e., without the use of meta data). GP evolves programs with multiple components. One component analyses statistical features extracted from the raw byte-series to divide the data into blocks. These blocks are then analysed via another component to obtain a signature for each file in a training set. These signatures are then projected onto a two-dimensional Euclidean space via two further (evolved) program components. K-means clustering is applied to group similar signatures. Each cluster is then labelled according to the dominant label for its members. Once a program that achieves good classification is evolved it can be used on unseen data without requiring any further evolution. Experimental results show that GP compares very well with established file classification algorithms (i.e., Neural Networks, Bayes Networks and J48 Decision Trees).