Time-efficient spam e-mail filtering using n-gram models

  • Authors:
  • Ali Çıltık;Tunga Güngör

  • Affiliations:
  • Department of Computer Engineering, Boğaziçi University, İstanbul 34342, Turkey;Department of Computer Engineering, Boğaziçi University, İstanbul 34342, Turkey

  • Venue:
  • Pattern Recognition Letters
  • Year:
  • 2008

Quantified Score

Hi-index 0.10

Visualization

Abstract

In this paper, we propose spam e-mail filtering methods having high accuracies and low time complexities. The methods are based on the n-gram approach and a heuristics which is referred to as the first n-words heuristics. We develop two models, a class general model and an e-mail specific model, and test the methods under these models. The models are then combined in such a way that the latter one is activated for the cases the first model falls short. Though the approach proposed and the methods developed are general and can be applied to any language, we mainly apply them to Turkish, which is an agglutinative language, and examine some properties of the language. Extensive tests were performed and success rates about 98% for Turkish and 99% for English were obtained. It has been shown that the time complexities can be reduced significantly without sacrificing performance.