Normalized Cuts and Image Segmentation
IEEE Transactions on Pattern Analysis and Machine Intelligence
Evaluation of hierarchical clustering algorithms for document datasets
Proceedings of the eleventh international conference on Information and knowledge management
Frequent term-based text clustering
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Document clustering based on non-negative matrix factorization
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions
The Journal of Machine Learning Research
Adaptive dimension reduction using discriminant analysis and K-means clustering
Proceedings of the 24th international conference on Machine learning
Semi-supervised spam filtering: does it work?
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Information theoretic measures for clusterings comparison: is a correction for chance necessary?
ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Semi-supervised spam filtering using aggressive consistency learning
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
A heuristic-based feature selection method for clustering spam emails
ICONIP'10 Proceedings of the 17th international conference on Neural information processing: theory and algorithms - Volume Part I
Improving document clustering using Okapi BM25 feature weighting
Information Retrieval
A sober look at clustering stability
COLT'06 Proceedings of the 19th annual conference on Learning Theory
Taster's choice: a comparative analysis of spam feeds
Proceedings of the 2012 ACM conference on Internet measurement conference
Hi-index | 0.00 |
We present a novel investigation of email clustering, demonstrating that clustering can be a powerful tool for email spam filtering. We first extend the well-known notion that ham and spam emails can be divided into clusters, showing the striking result that almost any reasonable clustering algorithm will naturally partition an email dataset into almost entirely spam and entirely spam clusters. We then consider the specific semi-supervised spam filtering scenario of filtering when a large amount of training data is available, but only a few true labels can be obtained for that data. We present two spam filtering approaches for this scenario, both of which start with a clustering of training email. Our first approach uses the true labels of the medoids of each cluster to train a spam filter; our second approach functions similar to the first, except that the true label of each cluster's medoid is used as the label of every email within the cluster, giving a much larger set of labels for training, while still only requiring only a few labels. We evaluate our approaches using the TREC2005 and CEAS2008 spam email datasets. For a large range of different numbers of true labels, we show that both of our approaches significantly outperform training on the same number of randomly selected email messages. The results of our second approach are also better than those of a previously published state-of-the-art semi-supervised small sample spam filtering approach.