Enhancing scalability in anomaly-based email spam filtering

Authors:
Carlos Laorden;Xabier Ugarte-Pedrero;Igor Santos;Borja Sanz;Pablo G. Bringas
Affiliations:
University of Deusto, Bilbao, Spain;University of Deusto, Bilbao, Spain;University of Deusto, Bilbao, Spain;University of Deusto, Bilbao, Spain;University of Deusto, Bilbao, Spain
Venue:
Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference
Year:
2011

Citing 14
Cited 0

The automatic identification of stop words

Journal of Information Science
A vector space model for automatic indexing

Communications of the ACM
Modern Information Retrieval

Modern Information Retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
A statistical approach to the spam problem

Linux Journal
A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists

Information Retrieval
An evaluation of statistical spam filtering techniques

ACM Transactions on Asian Language Information Processing (TALIP)
MailRank: using ranking for spam detection

Proceedings of the 14th ACM international conference on Information and knowledge management
A Formal Approach towards Assessing the Effectiveness of Anti-Spam Procedures

HICSS '06 Proceedings of the 39th Annual Hawaii International Conference on System Sciences - Volume 06
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
Spam Filtering Using Statistical Data Compression Models

The Journal of Machine Learning Research
Social phishing

Communications of the ACM
An Alliance-Based Anti-spam Approach

ICNC '07 Proceedings of the Third International Conference on Natural Computation - Volume 04
A study of cross-validation and bootstrap for accuracy estimation and model selection

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

Spam has become an important problem for computer security because it is a channel for the spreading of threats such as computer viruses, worms and phishing. Currently, more than 85% of received emails are spam. Historical approaches to combat these messages, including simple techniques such as sender blacklisting or the use of email signatures, are no longer completely reliable. Many solutions utilise machine-learning approaches trained using statistical representations of the terms that usually appear in the emails. However, these methods require a time-consuming training step with labelled data. Dealing with the situation where the availability of labelled training instances is limited slows down the progress of filtering systems and offers advantages to spammers. In a previous work, we presented the first spam filtering method based on anomaly detection that reduces the necessity of labelling spam messages and only employs the representation of legitimate emails. We showed that this method achieved high accuracy rates detecting spam while maintaining a low false positive rate and reducing the effort produced by labelling spam. In this paper, we enhance that system applying a data reduction algorithm to the labelled dataset, finding similarities among legitimate emails and grouping them to form consistent clusters that reduce the amount of needed comparisons. We show that this improvement reduces drastically the processing time, while maintaining detection and false positive rates stable.