Fast statistical spam filter by approximate classifications

Authors:
Kang Li;Zhenyu Zhong
Affiliations:
University of Georgia, Athens, Georgia;University of Georgia, Athens, Georgia
Venue:
SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
Year:
2006

Citing 9
Cited 11

MPEG: a video compression standard for multimedia applications

Communications of the ACM - Special issue on digital multimedia systems
Bayesian networks

Communications of the ACM
Summary cache: a scalable wide-area web cache sharing protocol

IEEE/ACM Transactions on Networking (TON)
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
Hash-based IP traceback

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
New directions in traffic measurement and accounting

IMW '01 Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement
A statistical approach to the spam problem

Linux Journal
Longest prefix matching using bloom filters

Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications
Lossy source coding

IEEE Transactions on Information Theory

An effective defense against email spam laundering

Proceedings of the 13th ACM conference on Computer and communications security
Review spam detection

Proceedings of the 16th international conference on World Wide Web
Thwarting E-mail Spam Laundering

ACM Transactions on Information and System Security (TISSEC)
Measurement and classification of humans and bots in internet chat

SS'08 Proceedings of the 17th conference on Security symposium
A survey of learning-based techniques of email spam filtering

Artificial Intelligence Review
Semi Supervised Image Spam Hunter: A Regularized Discriminant EM Approach

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Receiver-oriented design of Bloom filters for data-centric routing

Computer Networks: The International Journal of Computer and Telecommunications Networking
A survey of emerging approaches to spam filtering

ACM Computing Surveys (CSUR)
Humans and bots in internet chat: measurement, analysis, and automated classification

IEEE/ACM Transactions on Networking (TON)
A comparative study of cuckoo search and bat algorithm for Bloom filter optimisation in spam filtering

International Journal of Bio-Inspired Computation
SpaDeS: Detecting spammers at the source network

Computer Networks: The International Journal of Computer and Telecommunications Networking

Quantified Score

Hi-index	0.00

Visualization

Abstract

Statistical-based Bayesian filters have become a popular and important defense against spam. However, despite their effectiveness, their greater processing overhead can prevent them from scaling well for enterprise-level mail servers. For example, the dictionary lookups that are characteristic of this approach are limited by the memory access rate, therefore relatively insensitive to increases in CPU speed. We address this scaling issue by proposing an acceleration technique that speeds up Bayesian filters based on approximate classification. The approximation uses two methods: hash-based lookup and lossy encoding. Lookup approximation is based on the popular Bloom filter data structure with an extension to support value retrieval. Lossy encoding is used to further compress the data structure. While both methods introduce additional errors to a strict Bayesian approach, we show how the errors can be both minimized and biased toward a false negative classification.We demonstrate a 6x speedup over two well-known spam filters (bogofilter and qsf) while achieving an identical false positive rate and similar false negative rate to the original filters.