A large-scale empirical analysis of email spam detection through network characteristics in a stand-alone enterprise

Authors:
Tu Ouyang;Soumya Ray;Mark Allman;Michael Rabinovich
Affiliations:
-;-;-;-
Venue:
Computer Networks: The International Journal of Computer and Telecommunications Networking
Year:
2014

Citing 18
Cited 0

C4.5: programs for machine learning

C4.5: programs for machine learning
An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Machine Learning

Machine Learning
The Case against Accuracy Estimation for Comparing Induction Algorithms

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Spam filters: bayes vs. chi-squared; letters vs. words

ISICT '03 Proceedings of the 1st international symposium on Information and communication technologies
Characterizing a spam traffic

Proceedings of the 4th ACM SIGCOMM conference on Internet measurement
SpamAssassin

SpamAssassin
On-line spam filter fusion

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Understanding the network-level behavior of spammers

Proceedings of the 2006 conference on Applications, technologies, architectures, and protocols for computer communications
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Spam Filtering Based On The Analysis Of Text Information Embedded Into Images

The Journal of Machine Learning Research
Bro: a system for detecting network intruders in real-time

SSYM'98 Proceedings of the 7th conference on USENIX Security Symposium - Volume 7
Exploiting machine learning to subvert your spam filter

LEET'08 Proceedings of the 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats
Inferring Spammers in the Network Core

PAM '09 Proceedings of the 10th International Conference on Passive and Active Network Measurement
Detecting spammers with SNARE: spatio-temporal network-level automatic reputation engine

SSYM'09 Proceedings of the 18th conference on USENIX security symposium
Can network characteristics detect spam effectively in a stand-alone enterprise?

PAM'11 Proceedings of the 12th international conference on Passive and active measurement
Using classifier cascades for scalable e-mail classification

Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference
Auto-learning of SMTP TCP transport-layer features for spam and abusive message detection

LISA'11 Proceedings of the 25th international conference on Large Installation System Administration

Quantified Score

Hi-index	0.00

Visualization

Abstract

Spam is a never-ending issue that constantly consumes resources to no useful end. In this paper, we envision spam filtering as a pipeline consisting of DNS blacklists, filters based on SYN packet features, filters based on traffic characteristics and filters based on message content. Each stage of the pipeline examines more information in the message but is more computationally expensive. A message is rejected as spam once any layer is sufficiently confident. We analyze this pipeline, focusing on the first three layers, from a single-enterprise perspective. To do this we use a large email dataset collected over two years. We devise a novel ground truth determination system to allow us to label this large dataset accurately. Using two machine learning algorithms, we study (i) how the different pipeline layers interact with each other and the value added by each layer, (ii) the utility of individual features in each layer, (iii) stability of the layers across time and network events and (iv) an operational use case investigating whether this architecture can be practically useful. We find that (i) the pipeline architecture is generally useful in terms of accuracy as well as in an operational setting, (ii) it generally ages gracefully across long time periods and (iii) in some cases, later layers can compensate for poor performance in the earlier layers. Among the caveats we find are that (i) the utility of network features is not as high in the single enterprise viewpoint as reported in other prior work, (ii) major network events can sharply affect the detection rate, and (iii) the operational (computational) benefit of the pipeline may depend on the efficiency of the final content filter.