Spam filtering using Kolmogorov complexity analysis

Authors:
G. Richard;A. Doncescu
Affiliations:
IRIT, University of Toulouse, France.;LAAS CNRS, University of Toulouse, France
Venue:
International Journal of Web and Grid Services
Year:
2008

Citing 6
Cited 1

An introduction to Kolmogorov complexity and its applications (2nd ed.)

An introduction to Kolmogorov complexity and its applications (2nd ed.)
A Tutorial on Support Vector Machines for Pattern Recognition

Data Mining and Knowledge Discovery
Active Virtual Network Management Prediction: Complexity as a Framework for Prediction, Optimization, and Assurance

DANCE '02 Proceedings of the 2002 DARPA Active Networks Conference and Exposition
A Technique for High-Performance Data Compression

Computer
Information distance

IEEE Transactions on Information Theory
Clustering by compression

IEEE Transactions on Information Theory

Dictionary-based color image retrieval using multiset theory

Journal of Visual Communication and Image Representation

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the most irrelevant side effects of e-commerce technologyis the development of spamming as an e-marketing technique. Spame-mails (or unsolicited commercial e-mails) induce a burden foreverybody having an electronic mailbox: detecting and filteringspam is then a challenging task and a lot of approaches have beendeveloped to identify spam before it is posted in the end user'smailbox. In this paper, we focus on a relatively new approach whosefoundations rely on the works of A. Kolmogorov. The main idea is togive a formal meaning to the notion of 'information content' and toprovide a measure of this content. Using such a quantitativeapproach, it becomes possible to define a distance, which is amajor tool for classification purposes. To validate our approach,we proceed in two steps: first, we use the classical compressiondistance over a mix of spam and legitimate e-mails to check out ifthey can be properly clustered without any supervision. It has beenthe case to highlight a kind of underlying structure for spame-mails. In the second step, we have implemented a k-nearestneighbours algorithm providing 85% as accuracy rate. Coupled withother anti-spam techniques, compression-based methods could bring agreat help in the spam filtering challenge.