Concentration based feature construction approach for spam detection

  • Authors:
  • Ying Tan;Chao Deng;Guangchen Ruan

  • Affiliations:
  • Key Laboratory of Machine Perception and Intelligence and Department of Machine Intelligence, School of Electronics Engineering and Computer Science, Peking University, MOE, Beijing, P. R. China;Key Laboratory of Machine Perception and Intelligence and Department of Machine Intelligence, School of Electronics Engineering and Computer Science, Peking University, MOE, Beijing, P. R. China;Key Laboratory of Machine Perception and Intelligence and Department of Machine Intelligence, School of Electronics Engineering and Computer Science, Peking University, MOE, Beijing, P. R. China

  • Venue:
  • IJCNN'09 Proceedings of the 2009 international joint conference on Neural Networks
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Inspired by human immune system, a concentration based feature construction (CFC) approach which utilizes a two-element concentration vector as the feature vector is proposed for spam detection in this paper. In the CFC approach, 'self' and 'non-self' concentrations are constructed by using 'self' and 'non-self' gene libraries, respectively, and subsequently are used to form a vector with two elements of concentrations for characterizing the e-mail efficiently. As a result, the design of classifier actually amounts to establishing a mapping between two real-value inputs and one binary output. The classification of the e-mail is considered as an optimization problem aiming at minimizing a formulated cost function. A clonal particle swarm optimization (CPSO) algorithm proposed by the leading author is also employed for this purpose. Several classifiers including linear discriminant, multi-layer neural networks and support vector machine are used to verify the effectiveness and robustness of the CFC approach. Experimental results demonstrate that the proposed CFC approach not only has a very much fast speed but also gives 97% and 99% of accuracy just using a two-element concentration feature vector on corpus PU1 and Ling, respectively.