An introduction to Kolmogorov complexity and its applications (2nd ed.)
An introduction to Kolmogorov complexity and its applications (2nd ed.)
A Tutorial on Support Vector Machines for Pattern Recognition
Data Mining and Knowledge Discovery
DANCE '02 Proceedings of the 2002 DARPA Active Networks Conference and Exposition
IEEE Transactions on Information Theory
IEEE Transactions on Information Theory
Dictionary-based color image retrieval using multiset theory
Journal of Visual Communication and Image Representation
Hi-index | 0.00 |
One of the most irrelevant side effects of e-commerce technologyis the development of spamming as an e-marketing technique. Spame-mails (or unsolicited commercial e-mails) induce a burden foreverybody having an electronic mailbox: detecting and filteringspam is then a challenging task and a lot of approaches have beendeveloped to identify spam before it is posted in the end user'smailbox. In this paper, we focus on a relatively new approach whosefoundations rely on the works of A. Kolmogorov. The main idea is togive a formal meaning to the notion of 'information content' and toprovide a measure of this content. Using such a quantitativeapproach, it becomes possible to define a distance, which is amajor tool for classification purposes. To validate our approach,we proceed in two steps: first, we use the classical compressiondistance over a mix of spam and legitimate e-mails to check out ifthey can be properly clustered without any supervision. It has beenthe case to highlight a kind of underlying structure for spame-mails. In the second step, we have implemented a k-nearestneighbours algorithm providing 85% as accuracy rate. Coupled withother anti-spam techniques, compression-based methods could bring agreat help in the spam filtering challenge.