Supervised clustering of streaming data for email batch detection

Authors:
Peter Haider;Ulf Brefeld;Tobias Scheffer
Affiliations:
Max Planck Institute for Computer Science, Saarbrücken, Germany;Max Planck Institute for Computer Science, Saarbrücken, Germany;Max Planck Institute for Computer Science, Saarbrücken, Germany
Venue:
Proceedings of the 24th international conference on Machine learning
Year:
2007

Citing 10
Cited 5

Correlation Clustering

FOCS '02 Proceedings of the 43rd Symposium on Foundations of Computer Science
Constrained K-means Clustering with Background Knowledge

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Learning String Edit Distance

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Clustering Data Streams: Theory and Practice

IEEE Transactions on Knowledge and Data Engineering
Clustering binary data streams with K-means

DMKD '03 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Correlation Clustering: maximizing agreements via semidefinite programming

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Collision Module Integration in a Specific Graphic Engine for Terrain Visualization

IV '04 Proceedings of the Information Visualisation, Eighth International Conference
Large Margin Methods for Structured and Interdependent Output Variables

The Journal of Machine Learning Research
Supervised clustering with support vector machines

ICML '05 Proceedings of the 22nd international conference on Machine learning
Clustering with qualitative information

Journal of Computer and System Sciences - Special issue: Learning theory 2003

Learning the distance metric in a personal ontology

Proceedings of the 2nd international workshop on Ontologies and information systems for the semantic web
Review: A review of machine learning approaches to Spam filtering

Expert Systems with Applications: An International Journal
Bayesian clustering for email campaign detection

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Probabilistic structured predictors

UAI '09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
Max-Margin Early Event Detectors

International Journal of Computer Vision

Quantified Score

Hi-index	0.00

Visualization

Abstract

We address the problem of detecting batches of emails that have been created according to the same template. This problem is motivated by the desire to filter spam more effectively by exploiting collective information about entire batches of jointly generated messages. The application matches the problem setting of supervised clustering, because examples of correct clusterings can be collected. Known decoding procedures for supervised clustering are cubic in the number of instances. When decisions cannot be reconsidered once they have been made --- owing to the streaming nature of the data --- then the decoding problem can be solved in linear time. We devise a sequential decoding procedure and derive the corresponding optimization problem of supervised clustering. We study the impact of collective attributes of email batches on the effectiveness of recognizing spam emails.