Spam filtering using Kolmogorov complexity analysis

  • Authors:
  • G. Richard;A. Doncescu

  • Affiliations:
  • IRIT, University of Toulouse, France.;LAAS CNRS, University of Toulouse, France

  • Venue:
  • International Journal of Web and Grid Services
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

One of the most irrelevant side effects of e-commerce technologyis the development of spamming as an e-marketing technique. Spame-mails (or unsolicited commercial e-mails) induce a burden foreverybody having an electronic mailbox: detecting and filteringspam is then a challenging task and a lot of approaches have beendeveloped to identify spam before it is posted in the end user'smailbox. In this paper, we focus on a relatively new approach whosefoundations rely on the works of A. Kolmogorov. The main idea is togive a formal meaning to the notion of 'information content' and toprovide a measure of this content. Using such a quantitativeapproach, it becomes possible to define a distance, which is amajor tool for classification purposes. To validate our approach,we proceed in two steps: first, we use the classical compressiondistance over a mix of spam and legitimate e-mails to check out ifthey can be properly clustered without any supervision. It has beenthe case to highlight a kind of underlying structure for spame-mails. In the second step, we have implemented a k-nearestneighbours algorithm providing 85% as accuracy rate. Coupled withother anti-spam techniques, compression-based methods could bring agreat help in the spam filtering challenge.