Spam detection using web page content: a new battleground

Authors:
Marco Túlio Ribeiro;Pedro H. Calais Guerra;Leonardo Vilela;Adriano Veloso;Dorgival Guedes;Wagner Meira, Jr.;Marcelo H. P. C. Chaves;Klaus Steding-Jessen;Cristine Hoepers
Affiliations:
Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil;Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil;Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil;Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil;Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil;Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil;Brazilian Network Information Center (NIC.br), Sao Paulo, Brazil;Brazilian Network Information Center (NIC.br), Sao Paulo, Brazil;Brazilian Network Information Center (NIC.br), Sao Paulo, Brazil
Venue:
Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference
Year:
2011

Citing 16
Cited 0

"In vivo" spam filtering: a challenge problem for KDD

ACM SIGKDD Explorations Newsletter
Adversarial classification

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
Lazy Associative Classification for Content-based Spam Detection

LA-WEB '06 Proceedings of the Fourth Latin American Web Congress
Lazy Associative Classification

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Spam and the ongoing battle for the inbox

Communications of the ACM - Spam and the ongoing battle for the inbox
Learning to detect phishing emails

Proceedings of the 16th international conference on World Wide Web
Spamscatter: characterizing internet scam hosting infrastructure

SS'07 Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium
Email Spam Filtering: A Systematic Review

Foundations and Trends in Information Retrieval
Calibrated lazy associative classification

SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
Beyond blacklists: learning to detect malicious web sites from suspicious URLs

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
The foundations of cost-sensitive learning

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Ensembles in adversarial classification for spam

Proceedings of the 18th ACM conference on Information and knowledge management
Click Trajectories: End-to-End Analysis of the Spam Value Chain

SP '11 Proceedings of the 2011 IEEE Symposium on Security and Privacy
Design and Evaluation of a Real-Time URL Spam Filtering Service

SP '11 Proceedings of the 2011 IEEE Symposium on Security and Privacy
Support vector machines for spam categorization

IEEE Transactions on Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

Traditional content-based e-mail spam filtering takes into account content of e-mail messages and apply machine learning techniques to infer patterns that discriminate spams from hams. In particular, the use of content-based spam filtering unleashed an unending arms race between spammers and filter developers, given the spammers' ability to continuously change spam message content in ways that might circumvent the current filters. In this paper, we propose to expand the horizons of content-based filters by taking into consideration the content of the Web pages linked by e-mail messages. We describe a methodology for extracting pages linked by URLs in spam messages and we characterize the relationship between those pages and the messages. We then use a machine learning technique (a lazy associative classifier) to extract classification rules from the web pages that are relevant to spam detection. We demonstrate that the use of information from linked pages can nicely complement current spam classification techniques, as portrayed by SpamAssassin. Our study shows that the pages linked by spams are a very promising battleground.