Interaction between feature subset selection techniques and machine learning classifiers for detecting unsolicited emails

Authors:
Shrawan Kumar Trivedi;Shubhamoy Dey
Affiliations:
Indian Institute of Management, Prabandh Shikhar, Rau, Indore, India;Indian Institute of Management, Prabandh Shikhar, Rau, Indore, India
Venue:
ACM SIGAPP Applied Computing Review
Year:
2014

Citing 11
Cited 0

Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Identifying Junk Electronic Mail in Microsoft Outlook with a Support Vector Machine

SAINT '03 Proceedings of the 2003 Symposium on Applications and the Internet
Spam and the ongoing battle for the inbox

Communications of the ACM - Spam and the ongoing battle for the inbox
An empirical study of three machine learning methods for spam filtering

Knowledge-Based Systems
Relaxed online SVMs for spam filtering

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to this special issue on revisiting and reinventing e-mail

Human-Computer Interaction
An overview of statistical learning theory

IEEE Transactions on Neural Networks
Support vector machines for spam categorization

IEEE Transactions on Neural Networks
Effect of feature selection methods on machine learning classifiers for detecting email spams

Proceedings of the 2013 Research in Adaptive and Convergent Systems
An Enhanced Genetic Programming Approach for Detecting Unsolicited Emails

CSE '13 Proceedings of the 2013 IEEE 16th International Conference on Computational Science and Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Detection of the spam emails within a set of email files has become challenging task for researchers. Identification of an effective classifier is based not only on high accuracy of detection but also on low false alarm rates, and the need to use as few features as possible. In view of these challenges, this research examines the effects of using features selected by four feature subset selection methods (i.e. Genetic, Greedy Stepwise, Best First, and Rank Search) on popular Machine Learning Classifiers like Bayesian, Naive Bayes, Support Vector Machine, Genetic Algorithm, J48 and Random Forest. Tests were performed on three different publicly available spam email datasets: "Enron", "SpamAssassin" and "LingSpam". Results show that, Greedy Stepwise Search method is a good method for feature subset selection for spam email detection. Among the Machine Learning Classifiers, Support Vector Machine has been found to be the best classifier both in terms of accuracy and False Positive rate. However, results of Random Forest were very close to that of Support Vector Machine. The Genetic classifier was identified as a weak classifier.