2005 Special Issue: Efficient information theoretic strategies for classifier combination, feature extraction and performance evaluation in improving false positives and false negatives for spam e-mail filtering

  • Authors:
  • V. Zorkadis;D. A. Karras;M. Panayotou

  • Affiliations:
  • Data Protection Authority and Hellenic Open University, Athens, Greece;Dept. Automation and Hellenic Open University, Chalkis Institute of Technology, Rodu 2, Ano Iliupolis, Athens 16342, Greece;Hellenic Open University, Athens, Greece

  • Venue:
  • Neural Networks - 2005 Special issue: IJCNN 2005
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Spam emails are considered as a serious privacy-related violation, besides being a costly, unsolicited communication. Various spam filtering techniques have been so far proposed, mainly based on Naive Bayesian algorithms. Other Machine Learning algorithms like Boosting trees, or Support Vector Machines (SVM) have already been used with success. However, the number of False Positives (FP) and False Negatives (FN) resulting through applying various spam e-mail filters still remains too high and the problem of spam e-mail categorization cannot be solved completely from a practical viewpoint. In this paper, we propose a novel approach for spam e-mail filtering based on efficient information theoretic techniques for integrating classifiers, for extracting improved features and for properly evaluating categorization accuracy in terms of FP and FN. The goal of the presented methodology is to empirically but explicitly minimize these FP and FN numbers by combining high-performance FP filters with high-performance FN filters emerging from a previous work of the authors [Zorkadis, V., Panayotou, M., & Karras, D. A. (2005). Improved spam e-mail filtering based on committee machines and information theoretic feature extraction. Proceedings of the International Joint Conference on Neural Networks, July 31-August 4, 2005, Montreal, Canada]. To this end, Random Committee-based filters along with ADTree-based ones are efficiently combined through information theory, respectively. The experiments conducted are of the most extensive ones so far in the literature, exploiting widely accepted benchmarking e-mail data sets and comparing the proposed methodology with the Naive Bayes spam filter as well as with the Boosting tree methodology, the classification via regression and other machine learning models. It is illustrated by means of novel information theoretic measures of FP & FN filtering performance that the proposed approach is very favorably compared to the other rival methods. Finally, it is found that the proposed information theoretic Boolean features present a remarkably high spam categorization performance.