Time-efficient spam e-mail filtering using n-gram models

Authors:
Ali Çıltık;Tunga Güngör
Affiliations:
Department of Computer Engineering, Boğaziçi University, İstanbul 34342, Turkey;Department of Computer Engineering, Boğaziçi University, İstanbul 34342, Turkey
Venue:
Pattern Recognition Letters
Year:
2008

Citing 16
Cited 3

Automated learning of decision rules for text categorization

ACM Transactions on Information Systems (TOIS)
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Statistical Language Learning

Statistical Language Learning
A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists

Information Retrieval
Support vector machine active learning with applications to text classification

The Journal of Machine Learning Research
Using latent semantic indexing to filter spam

Proceedings of the 2003 ACM symposium on Applied computing
How to Do Everything to Fight Spam, Viruses, Pop-Ups, and Spyware

How to Do Everything to Fight Spam, Viruses, Pop-Ups, and Spyware
An evaluation of statistical spam filtering techniques

ACM Transactions on Asian Language Information Processing (TALIP)
Slamming Spam: A Guide for System Administrators

Slamming Spam: A Guide for System Administrators
Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification

Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification
Adaptive anti-spam filtering for agglutinative languages: a special case for Turkish

Pattern Recognition Letters
An Assessment of Case-Based Reasoning for Spam Filtering

Artificial Intelligence Review
On-line spam filter fusion

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Morphological Disambiguation of Turkish Text with Perceptron Algorithm

CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Developing an immunity to spam

GECCO'03 Proceedings of the 2003 international conference on Genetic and evolutionary computation: PartI
Support vector machines for spam categorization

IEEE Transactions on Neural Networks

Email Spam Filtering: A Systematic Review

Foundations and Trends in Information Retrieval
Review: A review of machine learning approaches to Spam filtering

Expert Systems with Applications: An International Journal
Measuring textual patent similarity on the basis of combined concepts: design decisions and their consequences

Scientometrics

Quantified Score

Hi-index	0.10

Visualization

Abstract

In this paper, we propose spam e-mail filtering methods having high accuracies and low time complexities. The methods are based on the n-gram approach and a heuristics which is referred to as the first n-words heuristics. We develop two models, a class general model and an e-mail specific model, and test the methods under these models. The models are then combined in such a way that the latter one is activated for the cases the first model falls short. Though the approach proposed and the methods developed are general and can be applied to any language, we mainly apply them to Turkish, which is an agglutinative language, and examine some properties of the language. Extensive tests were performed and success rates about 98% for Turkish and 99% for English were obtained. It has been shown that the time complexities can be reduced significantly without sacrificing performance.