Combining email models for false positive reduction

Authors:
Shlomo Hershkop;Salvatore J. Stolfo
Affiliations:
Columbia University, New York, NY;Columbia University, New York, NY
Venue:
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Year:
2005

Citing 21
Cited 11

A rule-based message filtering system

ACM Transactions on Information Systems (TOIS)
Combining classifiers in text categorization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
On Combining Classifiers

IEEE Transactions on Pattern Analysis and Machine Intelligence
MailCat: an intelligent assistant for organizing e-mail

Proceedings of the third annual conference on Autonomous Agents
An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Robust Classification for Imprecise Environments

Machine Learning
Sum Versus Vote Fusion in Multiple Classifier Systems

IEEE Transactions on Pattern Analysis and Machine Intelligence
Incremental Learning in SwiftFile

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Email classification with co-training

CASCON '01 Proceedings of the 2001 conference of the Centre for Advanced Studies on Collaborative research
MET: an experimental system for Malicious Email Tracking

Proceedings of the 2002 workshop on New security paradigms
A DEA approach for model combination

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
A comparison of event models for Naive Bayes anti-spam e-mail filtering

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Combining text and heuristics for cost-sensitive spam filtering

ConLL '00 Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7
Identifying spam without peeking at the contents

Crossroads
Learning spam: simple techniques for freely-available software

ATEC '03 Proceedings of the annual conference on USENIX Annual Technical Conference
The weighted majority algorithm

SFCS '89 Proceedings of the 30th Annual Symposium on Foundations of Computer Science
Ensembles as a sequence of classifiers

IJCAI'97 Proceedings of the Fifteenth international joint conference on Artifical intelligence - Volume 2
Combining naive bayes and n-gram language models for text classification

ECIR'03 Proceedings of the 25th European conference on IR research
Behavior profiling of email

ISI'03 Proceedings of the 1st NSF/NIJ conference on Intelligence and security informatics
Estimating continuous distributions in Bayesian classifiers

UAI'95 Proceedings of the Eleventh conference on Uncertainty in artificial intelligence
Support vector machines for spam categorization

IEEE Transactions on Neural Networks

Learn to Detect Phishing Scams Using Learning and Ensemble ?Methods

WI-IATW '07 Proceedings of the 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops
Partitioned logistic regression for spam filtering

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Nuisance level of a voice call

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Thwarting E-mail Spam Laundering

ACM Transactions on Information and System Security (TISSEC)
Symbiotic Data Mining for Personalized Spam Filtering

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Combining SVM classifiers for email anti-spam filtering

IWANN'07 Proceedings of the 9th international work conference on Artificial neural networks
Symbiotic filtering for spam email detection

Expert Systems with Applications: An International Journal
Social network analysis of web links to eliminate false positives in collaborative anti-spam systems

Journal of Network and Computer Applications
A survey of emerging approaches to spam filtering

ACM Computing Surveys (CSUR)
Multiple classifier systems under attack

MCS'10 Proceedings of the 9th international conference on Multiple Classifier Systems
Representations for multi-document event clustering

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

Machine learning and data mining can be effectively used to model, classify and discover interesting information for a wide variety of data including email. The Email Mining Toolkit, EMT, has been designed to provide a wide range of analyses for arbitrary email sources. Depending upon the task, one can usually achieve very high accuracy, but with some amount of false positive tradeoff. Generally false positives are prohibitively expensive in the real world. In the case of spam detection, for example, even if one email is misclassified, this may be unacceptable if it is a very important email. Much work has been done to improve specific algorithms for the task of detecting unwanted messages, but less work has been report on leveraging multiple algorithms and correlating models in this particular domain of email analysis.EMT has been updated with new correlation functions allowing the analyst to integrate a number of EMT's user behavior models available in the core technology. We present results of combining classifier outputs for improving both accuracy and reducing false positives for the problem of spam detection. We apply these methods to a very large email data set and show results of different combination methods on these corpora. We introduce a new method to compare multiple and combined classifiers, and show how it differs from past work. The method analyzes the relative gain and maximum possible accuracy that can be achieved for certain combinations of classifiers to automatically choose the best combination.