Identifying spam e-mail based-on statistical header features and sender behavior

Authors:
Aziz Qaroush;Ismail M. Khater;Mahdi Washaha
Affiliations:
Birzeit University, Birzeit, West Bank, Palestine;Birzeit University, Birzeit, West Bank, Palestine;Birzeit University, Birzeit, West Bank, Palestine
Venue:
Proceedings of the CUBE International Information Technology Conference
Year:
2012

Citing 4
Cited 0

Behavior-based spam detection using a hybrid method of rule-based techniques and neural networks

Expert Systems with Applications: An International Journal
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
A scalable intelligent non-content-based spam-filtering framework

Expert Systems with Applications: An International Journal
Spamcraft: an inside look at spam campaign orchestration

LEET'09 Proceedings of the 2nd USENIX conference on Large-scale exploits and emergent threats: botnets, spyware, worms, and more

Quantified Score

Hi-index	0.00

Visualization

Abstract

Email Spam filtering still a sophisticated and challenging problem as long as spammers continue developing new methods and techniques that are being used in their campaigns to defeat and confuse email spam filtering process. Moreover, utilizing email header information imposing additional challenges in classifying emails because the header information can be easily spoofed by spammers. Also, in recent years, spam has become a major problem at social, economical, political, and organizational levels because it decreases the employee productivity and causes traffic congestions in networks. In this paper, we present a powerful and useful email header features by utilizing the header session messages based on publicly datasets. Then, we apply many machine learning-based classifiers on the extracted header features to show the power of the extracted header features in filtering spam and ham messages by evaluating and comparing classifiers performance. In experiment stage, we apply the following classifiers: Random Forest (RF), C4.5 Decision Tree (J48), Voting Feature Intervals (VFI), Random Tree (RT), REPTree (REPT), Bayesian Network (BN), and Naïve Bayes (NB). The experimental results show that the RF classifier has the best performance with an accuracy, precision, recall, F-measure of 99.27%, 99.40%, 99.50%, and 99.50% when all mentioned features are used included the trust feature.