PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering

Authors:
Khurum Nazir Junejo;Asim Karim
Affiliations:
-;-
Venue:
WI '07 Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence
Year:
2007

Citing 0
Cited 3

A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 02
A user-oriented splog filtering based on a machine learning

BlogTalk'08/09 Proceedings of the 2008/2009 international conference on Social software: recent trends and developments in social software
A survey of emerging approaches to spam filtering

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The volume of spam e-mails has grown rapidly in the last two years resulting in increasing costs to users, network operators, and e-mail service providers (ESPs). E-mail users demand accurate spam filtering with minimum effort from their side. Since the distribution of spam and non-spam e-mails is often different for different users a single filter trained on a general corpus is not optimal for all users. The question asked by ESPs is: How do you build robust and scalable automatic personalized spam filters? We address this question by presenting PSSF, a novel statistical approach for personalized service-side spam filtering. PSSF builds a discriminative classifier from a statistical model of spam and non-spam e-mails. A classifier is first built on a general training corpus that is then adapted in one or more passes of soft labeling and classifier rebuilding over each user's unlabeled e-mails. The statistical model captures the distribution of tokens in spam and non-spam e-mails. This model is robust in the sense that its size can be reduced significantly without degrading filtering performance. We evaluate PSSF on two datasets. The results demonstrate the superior performance and scalability of PSSF in comparison with other published results on the same datasets.