Spam detection using web page content: a new battleground

  • Authors:
  • Marco Túlio Ribeiro;Pedro H. Calais Guerra;Leonardo Vilela;Adriano Veloso;Dorgival Guedes;Wagner Meira, Jr.;Marcelo H. P. C. Chaves;Klaus Steding-Jessen;Cristine Hoepers

  • Affiliations:
  • Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil;Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil;Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil;Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil;Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil;Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil;Brazilian Network Information Center (NIC.br), Sao Paulo, Brazil;Brazilian Network Information Center (NIC.br), Sao Paulo, Brazil;Brazilian Network Information Center (NIC.br), Sao Paulo, Brazil

  • Venue:
  • Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Traditional content-based e-mail spam filtering takes into account content of e-mail messages and apply machine learning techniques to infer patterns that discriminate spams from hams. In particular, the use of content-based spam filtering unleashed an unending arms race between spammers and filter developers, given the spammers' ability to continuously change spam message content in ways that might circumvent the current filters. In this paper, we propose to expand the horizons of content-based filters by taking into consideration the content of the Web pages linked by e-mail messages. We describe a methodology for extracting pages linked by URLs in spam messages and we characterize the relationship between those pages and the messages. We then use a machine learning technique (a lazy associative classifier) to extract classification rules from the web pages that are relevant to spam detection. We demonstrate that the use of information from linked pages can nicely complement current spam classification techniques, as portrayed by SpamAssassin. Our study shows that the pages linked by spams are a very promising battleground.