Clustering for semi-supervised spam filtering

Authors:
John S. Whissell;Charles L. A. Clarke
Affiliations:
University of Waterloo, Waterloo, Ontario, Canada;University of Waterloo, Waterloo, Ontario, Canada
Venue:
Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference
Year:
2011

Citing 13
Cited 1

Normalized Cuts and Image Segmentation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Evaluation of hierarchical clustering algorithms for document datasets

Proceedings of the eleventh international conference on Information and knowledge management
Frequent term-based text clustering

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Document clustering based on non-negative matrix factorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions

The Journal of Machine Learning Research
Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering

Machine Learning
Adaptive dimension reduction using discriminant analysis and K-means clustering

Proceedings of the 24th international conference on Machine learning
Semi-supervised spam filtering: does it work?

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Information theoretic measures for clusterings comparison: is a correction for chance necessary?

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Semi-supervised spam filtering using aggressive consistency learning

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
A heuristic-based feature selection method for clustering spam emails

ICONIP'10 Proceedings of the 17th international conference on Neural information processing: theory and algorithms - Volume Part I
Improving document clustering using Okapi BM25 feature weighting

Information Retrieval
A sober look at clustering stability

COLT'06 Proceedings of the 19th annual conference on Learning Theory

Taster's choice: a comparative analysis of spam feeds

Proceedings of the 2012 ACM conference on Internet measurement conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a novel investigation of email clustering, demonstrating that clustering can be a powerful tool for email spam filtering. We first extend the well-known notion that ham and spam emails can be divided into clusters, showing the striking result that almost any reasonable clustering algorithm will naturally partition an email dataset into almost entirely spam and entirely spam clusters. We then consider the specific semi-supervised spam filtering scenario of filtering when a large amount of training data is available, but only a few true labels can be obtained for that data. We present two spam filtering approaches for this scenario, both of which start with a clustering of training email. Our first approach uses the true labels of the medoids of each cluster to train a spam filter; our second approach functions similar to the first, except that the true label of each cluster's medoid is used as the label of every email within the cluster, giving a much larger set of labels for training, while still only requiring only a few labels. We evaluate our approaches using the TREC2005 and CEAS2008 spam email datasets. For a large range of different numbers of true labels, we show that both of our approaches significantly outperform training on the same number of randomly selected email messages. The results of our second approach are also better than those of a previously published state-of-the-art semi-supervised small sample spam filtering approach.