Spam detection using character n-grams

  • Authors:
  • Ioannis Kanaris;Konstantinos Kanaris;Efstathios Stamatatos

  • Affiliations:
  • Dept. of Information and Communication Systems Eng., University of the Aegean, Karlovassi, Greece;Dept. of Mathematics, University of the Aegean, Karlovassi, Greece;Dept. of Information and Communication Systems Eng., University of the Aegean, Karlovassi, Greece

  • Venue:
  • SETN'06 Proceedings of the 4th Helenic conference on Advances in Artificial Intelligence
  • Year:
  • 2006

Quantified Score

Hi-index 0.01

Visualization

Abstract

This paper presents a content-based approach to spam detection based on low-level information. Instead of the traditional 'bag of words' representation, we use a 'bag of character n-grams' representation which avoids the sparse data problem that arises in n-grams on the word-level. Moreover, it is language-independent and does not require any lemmatizer or 'deep' text preprocessing. Based on experiments on Ling-Spam corpus we evaluate the proposed representation in combination with support vector machines. Both binary and term-frequency representations achieve high precision rates while maintaining recall on equally high level, which is a crucial factor for anti-spam filters, a cost sensitive application.