C4.5: programs for machine learning
C4.5: programs for machine learning
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Machine Learning
The Case against Accuracy Estimation for Comparing Induction Algorithms
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Spam filters: bayes vs. chi-squared; letters vs. words
ISICT '03 Proceedings of the 1st international symposium on Information and communication technologies
Proceedings of the 4th ACM SIGCOMM conference on Internet measurement
SpamAssassin
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Understanding the network-level behavior of spammers
Proceedings of the 2006 conference on Applications, technologies, architectures, and protocols for computer communications
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Spam Filtering Based On The Analysis Of Text Information Embedded Into Images
The Journal of Machine Learning Research
Bro: a system for detecting network intruders in real-time
SSYM'98 Proceedings of the 7th conference on USENIX Security Symposium - Volume 7
Exploiting machine learning to subvert your spam filter
LEET'08 Proceedings of the 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats
Inferring Spammers in the Network Core
PAM '09 Proceedings of the 10th International Conference on Passive and Active Network Measurement
Detecting spammers with SNARE: spatio-temporal network-level automatic reputation engine
SSYM'09 Proceedings of the 18th conference on USENIX security symposium
Can network characteristics detect spam effectively in a stand-alone enterprise?
PAM'11 Proceedings of the 12th international conference on Passive and active measurement
Using classifier cascades for scalable e-mail classification
Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference
Auto-learning of SMTP TCP transport-layer features for spam and abusive message detection
LISA'11 Proceedings of the 25th international conference on Large Installation System Administration
Hi-index | 0.00 |
Spam is a never-ending issue that constantly consumes resources to no useful end. In this paper, we envision spam filtering as a pipeline consisting of DNS blacklists, filters based on SYN packet features, filters based on traffic characteristics and filters based on message content. Each stage of the pipeline examines more information in the message but is more computationally expensive. A message is rejected as spam once any layer is sufficiently confident. We analyze this pipeline, focusing on the first three layers, from a single-enterprise perspective. To do this we use a large email dataset collected over two years. We devise a novel ground truth determination system to allow us to label this large dataset accurately. Using two machine learning algorithms, we study (i) how the different pipeline layers interact with each other and the value added by each layer, (ii) the utility of individual features in each layer, (iii) stability of the layers across time and network events and (iv) an operational use case investigating whether this architecture can be practically useful. We find that (i) the pipeline architecture is generally useful in terms of accuracy as well as in an operational setting, (ii) it generally ages gracefully across long time periods and (iii) in some cases, later layers can compensate for poor performance in the earlier layers. Among the caveats we find are that (i) the utility of network features is not as high in the single enterprise viewpoint as reported in other prior work, (ii) major network events can sharply affect the detection rate, and (iii) the operational (computational) benefit of the pipeline may depend on the efficiency of the final content filter.