Adaptive anti-spam filtering for agglutinative languages: a special case for Turkish

Authors:
Levent Özgür;Tunga Güngör;Fikret Gürgen
Affiliations:
Department of Computer Engineering, Boǧaziçi University, Istanbul 34342, Turkey;Department of Computer Engineering, Boǧaziçi University, Istanbul 34342, Turkey;Department of Computer Engineering, Boǧaziçi University, Istanbul 34342, Turkey
Venue:
Pattern Recognition Letters
Year:
2004

Citing 13
Cited 12

Term clustering of syntactic phrases

SIGIR '90 Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval
Automated learning of decision rules for text categorization

ACM Transactions on Information Systems (TOIS)
Empirical methods for artificial intelligence

Empirical methods for artificial intelligence
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

Machine Learning - Special issue on learning with probabilistic representations
A probabilistic model of information retrieval: development and comparative experiments

Information Processing and Management: an International Journal
A probabilistic model of information retrieval: development and comparative experiments Part 2

Information Processing and Management: an International Journal
Neural Networks for Pattern Recognition

Neural Networks for Pattern Recognition
Machine Learning

Machine Learning
A feature mining based approach for the classification of text documents into disjoint classes

Information Processing and Management: an International Journal
Neural Networks for Web Content Filtering

IEEE Intelligent Systems
The CN2 Induction Algorithm

Machine Learning
Feature selection and feature extraction for text categorization

HLT '91 Proceedings of the workshop on Speech and Natural Language
Estimating continuous distributions in Bayesian classifiers

UAI'95 Proceedings of the Eleventh conference on Uncertainty in artificial intelligence

Time-efficient spam e-mail filtering using n-gram models

Pattern Recognition Letters
A comparative study for content-based dynamic spam classification using four machine learning algorithms

Knowledge-Based Systems
Review: A review of machine learning approaches to Spam filtering

Expert Systems with Applications: An International Journal
Classification of skewed and homogenous document corpora with class-based and corpus-based keywords

KI'06 Proceedings of the 29th annual German conference on Artificial intelligence
A neural model in anti-spam systems

ICANN'06 Proceedings of the 16th international conference on Artificial Neural Networks - Volume Part II
Text categorization with class-based and corpus-based keyword selection

ISCIS'05 Proceedings of the 20th international conference on Computer and Information Sciences
Using double-layer one-class classification for anti-jamming information filtering

ISNN'05 Proceedings of the Second international conference on Advances in Neural Networks - Volume Part III
SDAI: An integral evaluation methodology for content-based spam filtering models

Expert Systems with Applications: An International Journal
Text categorization based on fuzzy soft set theory

ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part IV
Rough sets for spam filtering: Selecting appropriate decision rules for boundary e-mail classification

Applied Soft Computing
Developing methods and heuristics with low time complexities for filtering spam messages

NLDB'07 Proceedings of the 12th international conference on Applications of Natural Language to Information Systems
Personalized email recommender system based on user actions

SEAL'12 Proceedings of the 9th international conference on Simulated Evolution and Learning

Quantified Score

Hi-index	0.11

Visualization

Abstract

We propose anti-spare filtering methods for agglutinative languages in general and for Turkish in particular. The methods are dynamic and are based on Artificial Neural Networks (ANN) and Bayesian Networks. The developed algorithms are user-specific and adapt themselves with the characteristics of the incoming e-mails. The algorithms have two main components. The first one deals with the morphology of the words and the second one classifies the e-mails by using the roots of the words extracted by the morphological analysis. Two ANN structures, single layer perceptron and multi-layer perceptron, are considered and the inputs to the networks are determined using binary model and probabilistic model. Similarly, for Bayesian classification, three different approaches are employed: binary model, probabilistic model, and advanced probabilistic model. In the experiments, a total of 750 e-mails (410 spare and 340 normal) were used and a success rate of about 90% was achieved.