A comparative performance study of feature selection methods for the anti-spam filtering domain

  • Authors:
  • J. R. Méndez;F. Fdez-Riverola;F. Díaz;E. L. Iglesias;J. M. Corchado

  • Affiliations:
  • Dept. Informática, University of Vigo, Ourense, Spain;Dept. Informática, University of Vigo, Ourense, Spain;Dept. Informática, University of Valladolid, Segovia, Spain;Dept. Informática, University of Vigo, Ourense, Spain;Dept. Informática y Automática, University of Salamanca, Salamanca, Spain

  • Venue:
  • ICDM'06 Proceedings of the 6th Industrial Conference on Data Mining conference on Advances in Data Mining: applications in Medicine, Web Mining, Marketing, Image and Signal Mining
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we analyse the strengths and weaknesses of the mainly used feature selection methods in text categorization when they are applied to the spam problem domain. Several experiments with different feature selection methods and content-based filtering techniques are carried out and discussed. Information Gain, χ2-text, Mutual Information and Document Frequency feature selection methods have been analysed in conjunction with Naïve Bayes, boosting trees, Support Vector Machines and ECUE models in different scenarios. From the experiments carried out the underlying ideas behind feature selection methods are identified and applied for improving the feature selection process of SpamHunting, a novel anti-spam filtering software able to accurate classify suspicious e-mails.