Fast statistical spam filter by approximate classifications

  • Authors:
  • Kang Li;Zhenyu Zhong

  • Affiliations:
  • University of Georgia, Athens, Georgia;University of Georgia, Athens, Georgia

  • Venue:
  • SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Statistical-based Bayesian filters have become a popular and important defense against spam. However, despite their effectiveness, their greater processing overhead can prevent them from scaling well for enterprise-level mail servers. For example, the dictionary lookups that are characteristic of this approach are limited by the memory access rate, therefore relatively insensitive to increases in CPU speed. We address this scaling issue by proposing an acceleration technique that speeds up Bayesian filters based on approximate classification. The approximation uses two methods: hash-based lookup and lossy encoding. Lookup approximation is based on the popular Bloom filter data structure with an extension to support value retrieval. Lossy encoding is used to further compress the data structure. While both methods introduce additional errors to a strict Bayesian approach, we show how the errors can be both minimized and biased toward a false negative classification.We demonstrate a 6x speedup over two well-known spam filters (bogofilter and qsf) while achieving an identical false positive rate and similar false negative rate to the original filters.