Enhanced email spam filtering through combining similarity graphs

Authors:
Anirban Dasgupta;Maxim Gurevich;Kunal Punera
Affiliations:
Yahoo! Research, Sunnyvale, CA, USA;Yahoo! Research, Sunnyvale, CA, USA;Yahoo! Research, Sunnyvale, CA, USA
Venue:
Proceedings of the fourth ACM international conference on Web search and data mining
Year:
2011

Citing 20
Cited 3

A Framework for Collaborative, Content-Based and Demographic Filtering

Artificial Intelligence Review - Special issue on data mining on the Internet
A joint framework for collaborative and content filtering

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Adversarial classification

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
A support vector method for multivariate performance measures

ICML '05 Proceedings of the 22nd international conference on Machine learning
Why phishing works

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Spam and the ongoing battle for the inbox

Communications of the ACM - Spam and the ongoing battle for the inbox
Spam Filtering Using Statistical Data Compression Models

The Journal of Machine Learning Research
Relaxed online SVMs for spam filtering

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
An evaluation of Naive Bayes variants in content-based learning for spam filtering

Intelligent Data Analysis
A theory of learning with similarity functions

Machine Learning
Web spam identification through content and hyperlinks

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Spamalytics: an empirical analysis of spam marketing conversion

Proceedings of the 15th ACM conference on Computer and communications security
Feature hashing for large scale multitask learning

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Regression-based latent factor models

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
On the relative age of spam and ham training samples for email filtering

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Matrix Factorization Techniques for Recommender Systems

Computer
Pairwise preference regression for cold-start recommendation

Proceedings of the third ACM conference on Recommender systems
New filtering approaches for phishing email

Journal of Computer Security - EU-Funded ICT Research on Trust and Security
Spamcraft: an inside look at spam campaign orchestration

LEET'09 Proceedings of the 2nd USENIX conference on Large-scale exploits and emergent threats: botnets, spyware, worms, and more
Re: CAPTCHAs: understanding CAPTCHA-solving services in an economic context

USENIX Security'10 Proceedings of the 19th USENIX conference on Security

Impact of spam exposure on user engagement

Security'12 Proceedings of the 21st USENIX conference on Security symposium
Grindstone4Spam: An optimization toolkit for boosting e-mail classification

Journal of Systems and Software
Personalized email recommender system based on user actions

SEAL'12 Proceedings of the 9th international conference on Simulated Evolution and Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Over the last decade Email Spam has evolved from being just an irritant to users to being truly dangerous. This has led web-mail providers and academic researchers to dedicate considerable resources towards tackling this problem [9, 21, 22, 24, 26]. However, we argue that some aspects of the spam filtering problem are not handled appropriately in existing work. Principal among these are adversarial spammer efforts -- spammers routinely tune their spam emails to bypass spam-filters, and contaminate ground truth via fake HAM/SPAM votes -- and the scale and sparsity of the problem, which essentially precludes learning with a very large set of parameters. In this paper we propose an approach that learns to filter spam by striking a balance between generalizing HAM/SPAM votes across users and emails (to alleviate sparsity) and learning local models for each user (to limit effect of adversarial votes); votes are shared only amongst users and emails that are "similar" to one another. Moreover, we define user-user and email-email similarities using spam-resilient features that are extremely difficult for spammers to fake. We give a methodology that learns to combine multiple features into similarity values while directly optimizing the objective of better spam filtering. A useful side effect of this methodology is that the number of parameters that need to be estimated is very small: this helps us use off-the-shelf learning algorithms to achieve good accuracy while preventing over-training to the adversarial noise in the data. Finally, our approach gives a systematic way to incorporate existing spam-fighting technologies such as IP blacklists, keyword based classifiers, etc into one framework. Experiments on a real-world email dataset show that our approach leads to significant improvements compared to two state-of-the-art baselines.