An evaluation of statistical spam filtering techniques

Authors:
Le Zhang;Jingbo Zhu;Tianshun Yao
Affiliations:
Natural Language Processing Laboratory, Institute of Computer Software & Theory, Northeastern University;Natural Language Processing Laboratory, Institute of Computer Software & Theory, Northeastern University;Natural Language Processing Laboratory, Institute of Computer Software & Theory, Northeastern University
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2004

Citing 21
Cited 55

On the limited memory BFGS method for large scale optimization

Mathematical Programming: Series A and B
An example-based mapping method for text categorization and retrieval

ACM Transactions on Information Systems (TOIS)
The nature of statistical learning theory

The nature of statistical learning theory
Evaluating and optimizing autonomous text classification systems

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Support-Vector Networks

Machine Learning
A maximum entropy approach to natural language processing

Computational Linguistics
Inducing Features of Random Fields

IEEE Transactions on Pattern Analysis and Machine Intelligence
Making large-scale support vector machine learning practical

Advances in kernel methods
Approximate statistical tests for comparing supervised classification learning algorithms

Neural Computation
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
BoosTexter: A Boosting-based Systemfor Text Categorization

Machine Learning - Special issue on information retrieval
An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization

Machine Learning
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Maximum entropy models for natural language ambiguity resolution

Maximum entropy models for natural language ambiguity resolution
A comparison of event models for Naive Bayes anti-spam e-mail filtering

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Boosting trees for clause splitting

ConLL '01 Proceedings of the 2001 workshop on Computational Natural Language Learning - Volume 7
Support vector machines for spam categorization

IEEE Transactions on Neural Networks

A suffix tree approach to anti-spam email filtering

Machine Learning
Web-based text classification in the absence of manually labeled training documents

Journal of the American Society for Information Science and Technology
Online supervised spam filter evaluation

ACM Transactions on Information Systems (TOIS)
Spam Filtering Based On The Analysis Of Text Information Embedded Into Images

The Journal of Machine Learning Research
A comparison of machine learning techniques for phishing detection

Proceedings of the anti-phishing working groups 2nd annual eCrime researchers summit
Time-efficient spam e-mail filtering using n-gram models

Pattern Recognition Letters
Detecting spam email by radial basis function networks

International Journal of Knowledge-based and Intelligent Engineering Systems
Effective spam filtering: A single-class learning and ensemble approach

Decision Support Systems
Searching for Interacting Features for Spam Filtering

ISNN '08 Proceedings of the 5th international symposium on Neural Networks: Advances in Neural Networks
Email Spam Filtering: A Systematic Review

Foundations and Trends in Information Retrieval
Evaluation of spam detection and prevention frameworks for email and image spam: a state of art

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
An Operable Email Based Intelligent Personal Assistant

World Wide Web
Review: A review of machine learning approaches to Spam filtering

Expert Systems with Applications: An International Journal
ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift

Proceedings of the 2006 conference on ECAI 2006: 17th European Conference on Artificial Intelligence August 29 -- September 1, 2006, Riva del Garda, Italy
A discrete mixture-based kernel for SVMs: Application to spam and image categorization

Information Processing and Management: an International Journal
A survey of learning-based techniques of email spam filtering

Artificial Intelligence Review
Study on Ensemble Classification Methods towards Spam Filtering

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
A simple yet effective spam blocking method

Proceedings of the 2nd international conference on Security of information and networks
Vlogging: A survey of videoblogging technology on the web

ACM Computing Surveys (CSUR)
Filtering spams using the minimum description length principle

Proceedings of the 2010 ACM Symposium on Applied Computing
Applying cost-sensitive multiobjective genetic programming to feature extraction for spam e-mail filtering

EuroGP'08 Proceedings of the 11th European conference on Genetic programming
A neural tree and its application to spam e-mail detection

Expert Systems with Applications: An International Journal
Cuisine: Classification using stylistic feature sets and-or name-based feature sets

Journal of the American Society for Information Science and Technology
Application of genetic optimized artificial immune system and neural networks in spam detection

Applied Soft Computing
Word co-occurrence features for text classification

Information Systems
Anomaly Detection in Dynamic Systems Using Weak Estimators

ACM Transactions on Internet Technology (TOIT)
Detecting bots via incremental LS-SVM learning with dynamic feature adaptation

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Collective classification for spam filtering

CISIS'11 Proceedings of the 4th international conference on Computational intelligence in security for information systems
Enhancing scalability in anomaly-based email spam filtering

Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference
Contributions to the study of SMS spam filtering: new collection and results

Proceedings of the 11th ACM symposium on Document engineering
Enhanced Topic-based Vector Space Model for semantics-aware spam filtering

Expert Systems with Applications: An International Journal
Privacy protected knowledge management in services with emphasis on quality data

Proceedings of the 20th ACM international conference on Information and knowledge management
Application and evaluation of bayesian filter for chinese spam

Inscrypt'06 Proceedings of the Second SKLOIS conference on Information Security and Cryptology
A survey of emerging approaches to spam filtering

ACM Computing Surveys (CSUR)
A neural model in anti-spam systems

ICANN'06 Proceedings of the 16th international conference on Artificial Neural Networks - Volume Part II
An immunological filter for spam

ICARIS'06 Proceedings of the 5th international conference on Artificial Immune Systems
Machine learning techniques and chi-square feature selection for cancer classification using SAGE gene expression profiles

BioDM'06 Proceedings of the 2006 international conference on Data Mining for Biomedical Applications
NASC: a novel approach for spam classification

ICIC'06 Proceedings of the 2006 international conference on Computational Intelligence and Bioinformatics - Volume Part III
Generating estimates of classification confidence for a case-based spam filter

ICCBR'05 Proceedings of the 6th international conference on Case-Based Reasoning Research and Development
Facing the spammers: A very effective approach to avoid junk e-mails

Expert Systems with Applications: An International Journal
Behaviour-Based web spambot detection by utilising action time and action frequency

ICCSA'10 Proceedings of the 2010 international conference on Computational Science and Its Applications - Volume Part II
Review: SMS spam filtering: Methods and data

Expert Systems with Applications: An International Journal
Active online classification via information maximization

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Two
Word sense disambiguation for spam filtering

Electronic Commerce Research and Applications
Statistical cross-language Web content quality assessment

Knowledge-Based Systems
Longtime behavior of harvesting spam bots

Proceedings of the 2012 ACM conference on Internet measurement conference
A Self-Supervised Approach to Comment Spam Detection Based on Content Analysis

International Journal of Information Security and Privacy
An ontology enhanced parallel SVM for scalable spam filter training

Neurocomputing
Which work-item updates need your response?

Proceedings of the 10th Working Conference on Mining Software Repositories
Reversing the effects of tokenisation attacks against content-based spam filters

International Journal of Security and Networks
Character usage in Chinese short message service SMS: a real-world study in Mainland China

International Journal of Mobile Communications
Genetic optimized artificial immune system in spam detection: a review and a model

Artificial Intelligence Review
Hybrid email spam detection model with negative selection algorithm and differential evolution

Engineering Applications of Artificial Intelligence
Feature identification for topical relevance assessment in feed search engines

Intelligent Data Analysis
Learning to filter spam emails: An ensemble learning approach

International Journal of Hybrid Intelligent Systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper evaluates five supervised learning methods in the context of statistical spam filtering. We study the impact of different feature pruning methods and feature set sizes on each learner's performance using cost-sensitive measures. It is observed that the significance of feature selection varies greatly from classifier to classifier. In particular, we found support vector machine, AdaBoost, and maximum entropy model are top performers in this evaluation, sharing similar characteristics: not sensitive to feature selection strategy, easily scalable to very high feature dimension, and good performances across different datasets. In contrast, naive Bayes, a commonly used classifier in spam filtering, is found to be sensitive to feature selection methods on small feature set, and fails to function well in scenarios where false positives are penalized heavily. The experiments also suggest that aggressive feature pruning should be avoided when building filters to be used in applications where legitimate mails are assigned a cost much higher than spams (such as λ = 999), so as to maintain a better-than-baseline performance. An interesting finding is the effect of mail headers on spam filtering, which is often ignored in previous studies. Experiments show that classifiers using features from message header alone can achieve comparable or better performance than filters utilizing body features only. This implies that message headers can be reliable and powerfully discriminative feature sources for spam filtering.