Using GMDH-based networks for improved spam detection and email feature analysis

Authors:
El-Sayed M. El-Alfy;Radwan E. Abdel-Aal
Affiliations:
College of Computer Sciences and Engineering, King Fahd University of Petroleum and Minerals, Dhahran 31261, Saudi Arabia;College of Computer Sciences and Engineering, King Fahd University of Petroleum and Minerals, Dhahran 31261, Saudi Arabia
Venue:
Applied Soft Computing
Year:
2011

Citing 20
Cited 2

Original Contribution: Stacked generalization

Neural Networks
An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists

Information Retrieval
Identifying Junk Electronic Mail in Microsoft Outlook with a Support Vector Machine

SAINT '03 Proceedings of the 2003 Symposium on Applications and the Internet
A Neural Network Based Approach to Automated E-Mail Classification

WI '03 Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence
An empirical study of spam traffic and the use of DNS black lists

Proceedings of the 4th ACM SIGCOMM conference on Internet measurement
Adapting Bayesian statistical spam filters to the server side

Journal of Computing Sciences in Colleges
A comparison of event models for Naive Bayes anti-spam e-mail filtering

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
2005 Special Issue: Efficient information theoretic strategies for classifier combination, feature extraction and performance evaluation in improving false positives and false negatives for spam e-mail filtering

Neural Networks - 2005 Special issue: IJCNN 2005
GMDH-based feature ranking and selection for improved classification of medical data

Journal of Biomedical Informatics
An introduction to ROC analysis

Pattern Recognition Letters - Special issue: ROC analysis in pattern recognition
An HMM for detecting spam mail

Expert Systems with Applications: An International Journal
An empirical study of three machine learning methods for spam filtering

Knowledge-Based Systems
Workload models of spam and legitimate e-mails

Performance Evaluation
An incremental cluster-based approach to spam filtering

Expert Systems with Applications: An International Journal
Construction and analysis of educational tests using abductive machine learning

Computers & Education
Introduction to Information Retrieval

Introduction to Information Retrieval
On the properties of spam-advertised URL addresses

Journal of Network and Computer Applications
A fuzzy similarity approach for automated spam filtering

AICCSA '08 Proceedings of the 2008 IEEE/ACS International Conference on Computer Systems and Applications
Support vector machines for spam categorization

IEEE Transactions on Neural Networks

An evolutionary-based hyper-heuristic approach for optimal construction of group method of data handling networks

Information Sciences: an International Journal
Obtaining the threat model for e-mail phishing

Applied Soft Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Unsolicited or spam email has recently become a major threat that can negatively impact the usability of electronic mail. Spam substantially wastes time and money for business users and network administrators, consumes network bandwidth and storage space, and slows down email servers. In addition, it provides a medium for distributing harmful code and/or offensive content. In this paper, we explore the application of the GMDH (Group Method of Data Handling) based inductive learning approach in detecting spam messages by automatically identifying content features that effectively distinguish spam from legitimate emails. We study the performance for various network model complexities using spambase, a publicly available benchmark dataset. Results reveal that classification accuracies of 91.7% can be achieved using only 10 out of the available 57 attributes, selected through abductive learning as the most effective feature subset (i.e. 82.5% data reduction). We also show how to improve classification performance using abductive network ensembles (committees) trained on different subsets of the training data. Comparison with other techniques such as neural networks and naive Bayesian classifiers shows that the GMDH-based learning approach can provide better spam detection accuracy with false-positive rates as low as 4.3% and yet requires shorter training time.