Clustering for semi-supervised spam filtering

  • Authors:
  • John S. Whissell;Charles L. A. Clarke

  • Affiliations:
  • University of Waterloo, Waterloo, Ontario, Canada;University of Waterloo, Waterloo, Ontario, Canada

  • Venue:
  • Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a novel investigation of email clustering, demonstrating that clustering can be a powerful tool for email spam filtering. We first extend the well-known notion that ham and spam emails can be divided into clusters, showing the striking result that almost any reasonable clustering algorithm will naturally partition an email dataset into almost entirely spam and entirely spam clusters. We then consider the specific semi-supervised spam filtering scenario of filtering when a large amount of training data is available, but only a few true labels can be obtained for that data. We present two spam filtering approaches for this scenario, both of which start with a clustering of training email. Our first approach uses the true labels of the medoids of each cluster to train a spam filter; our second approach functions similar to the first, except that the true label of each cluster's medoid is used as the label of every email within the cluster, giving a much larger set of labels for training, while still only requiring only a few labels. We evaluate our approaches using the TREC2005 and CEAS2008 spam email datasets. For a large range of different numbers of true labels, we show that both of our approaches significantly outperform training on the same number of randomly selected email messages. The results of our second approach are also better than those of a previously published state-of-the-art semi-supervised small sample spam filtering approach.