Email Spam Filtering: A Systematic Review

Authors:
Gordon V. Cormack
Affiliations:
-
Venue:
Foundations and Trends in Information Retrieval
Year:
2008

Citing 54
Cited 22

Data compression using dynamic Markov modelling

The Computer Journal
C4.5: programs for machine learning

C4.5: programs for machine learning
Bagging predictors

Machine Learning
Training algorithms for linear text classifiers

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

Machine Learning - Special issue on learning with probabilistic representations
Improved Boosting Algorithms Using Confidence-rated Predictions

Machine Learning - The Eleventh Annual Conference on computational Learning Theory
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Evaluating cost-sensitive Unsolicited Bulk Email categorization

Proceedings of the 2002 ACM symposium on Applied computing
Information Retrieval

Information Retrieval
A statistical approach to the spam problem

Linux Journal
A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists

Information Retrieval
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Pricing via Processing or Combatting Junk Mail

CRYPTO '92 Proceedings of the 12th Annual International Cryptology Conference on Advances in Cryptology
Using Character Recognition and Segmentation to Tell Computer from Humans

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Tree induction vs. logistic regression: a learning-curve analysis

The Journal of Machine Learning Research
Using latent semantic indexing to filter spam

Proceedings of the 2003 ACM symposium on Applied computing
"In vivo" spam filtering: a challenge problem for KDD

ACM SIGKDD Explorations Newsletter
Adversarial classification

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
An evaluation of statistical spam filtering techniques

ACM Transactions on Asian Language Information Processing (TALIP)
Combining winnow and orthogonal sparse bigrams for incremental spam filtering

PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
Designing human friendly human interaction proofs (HIPs)

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
A comparison of event models for Naive Bayes anti-spam e-mail filtering

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Fighting Spam with Reputation Systems

Queue - Social Computing
Image Analysis for Efficient Categorization of Image-based Spam E-mail

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
An Assessment of Case-Based Reasoning for Spam Filtering

Artificial Intelligence Review
Spam Detection Using Text Clustering

CW '05 Proceedings of the 2005 International Conference on Cyberworlds
Combining text and heuristics for cost-sensitive spam filtering

ConLL '00 Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7
TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)

TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)
Compression and Machine Learning: A New Perspective on Feature Space Vectors

DCC '06 Proceedings of the Data Compression Conference
Exploiting structural information for semi-structured document categorization

Information Processing and Management: an International Journal
Peer-to-peer collaborative spam detection

Crossroads
On-line spam filter fusion

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Statistical precision of information retrieval evaluation

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Content based SMS spam filtering

Proceedings of the 2006 ACM symposium on Document engineering
Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing)

Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing)
Artificial immune system inspired behavior-based anti-spam filter

Soft Computing - A Fusion of Foundations, Methodologies and Applications - Web intelligence and change discovery
Online supervised spam filter evaluation

ACM Transactions on Information Systems (TOIS)
Spam Filtering Using Statistical Data Compression Models

The Journal of Machine Learning Research
Spam Filtering Based On The Analysis Of Text Information Embedded Into Images

The Journal of Machine Learning Research
Detecting spam in VoIP networks

SRUTI'05 Proceedings of the Steps to Reducing Unwanted Traffic on the Internet on Steps to Reducing Unwanted Traffic on the Internet Workshop
Discriminative learning for differing training and test distributions

Proceedings of the 24th international conference on Machine learning
Relaxed online SVMs for spam filtering

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Feature engineering for mobile (SMS) spam filtering

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Time-efficient spam e-mail filtering using n-gram models

Pattern Recognition Letters
Spam filtering for short messages

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Spam Mail Reduces Economic Effects

ICDS '08 Proceedings of the Second International Conference on Digital Society
Lexicon randomization for near-duplicate detection with I-Match

The Journal of Supercomputing
Dynamically weighted hidden Markov model for spam deobfuscation

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
A study of cross-validation and bootstrap for accuracy estimation and model selection

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
An anti-spam scheme using pre-challenges

Computer Communications
Estimating continuous distributions in Bayesian classifiers

UAI'95 Proceedings of the Eleventh conference on Uncertainty in artificial intelligence
The context-tree weighting method: basic properties

IEEE Transactions on Information Theory
Vipul's Razor: The mechanics of Vipul's Razor technology

Network Security
Support vector machines for spam categorization

IEEE Transactions on Neural Networks

Genre-based decomposition of email class noise

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Study on Ensemble Classification Methods towards Spam Filtering

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Probabilistic anti-spam filtering with dimensionality reduction

Proceedings of the 2010 ACM Symposium on Applied Computing
Filtering spams using the minimum description length principle

Proceedings of the 2010 ACM Symposium on Applied Computing
Uncovering social spammers: social honeypots + machine learning

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Multi-field learning for email spam filtering

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Advantages and vulnerabilities of pull-based email-delivery

AISC '10 Proceedings of the Eighth Australasian Conference on Information Security - Volume 105
Cooperative anti-spam system based on multilayer agents

Proceedings of the 20th international conference companion on World wide web
Adversarial Web Search

Foundations and Trends in Information Retrieval
Email shape analysis

ICDCN'10 Proceedings of the 11th international conference on Distributed computing and networking
Spam detection using web page content: a new battleground

Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference
Contributions to the study of SMS spam filtering: new collection and results

Proceedings of the 11th ACM symposium on Document engineering
Comment spam detection by sequence mining

Proceedings of the fifth ACM international conference on Web search and data mining
Facing the spammers: A very effective approach to avoid junk e-mails

Expert Systems with Applications: An International Journal
Impact of spam exposure on user engagement

Security'12 Proceedings of the 21st USENIX conference on Security symposium
Diversionary comments under political blog posts

Proceedings of the 21st ACM international conference on Information and knowledge management
Crime scene investigation: SMS spam data analysis

Proceedings of the 2012 ACM conference on Internet measurement conference
FIMESS: filtering mobile external SMS spam

Proceedings of the 6th Balkan Conference in Informatics
Survey and taxonomy of botnet research through life-cycle

ACM Computing Surveys (CSUR)
Bayesian mixed-effects inference on classification performance in hierarchical data sets

The Journal of Machine Learning Research
TorteMail: solving email information overload

Proceedings of the 25th Australian Computer-Human Interaction Conference: Augmentation, Application, Innovation, Collaboration
Campaign extraction from social media

ACM Transactions on Intelligent Systems and Technology (TIST) - Special Section on Intelligent Mobile Knowledge Discovery and Management Systems and Special Issue on Social Web Mining

Quantified Score

Hi-index	0.01

Visualization

Abstract

Spam is information crafted to be delivered to a large number of recipients, in spite of their wishes. A spam filter is an automated tool to recognize spam so as to prevent its delivery. The purposes of spam and spam filters are diametrically opposed: spam is effective if it evades filters, while a filter is effective if it recognizes spam. The circular nature of these definitions, along with their appeal to the intent of sender and recipient make them difficult to formalize. A typical email user has a working definition no more formal than "I know it when I see it." Yet, current spam filters are remarkably effective, more effective than might be expected given the level of uncertainty and debate over a formal definition of spam, more effective than might be expected given the state-of-the-art information retrieval and machine learning methods for seemingly similar problems. But are they effective enough? Which are better? How might they be improved? Will their effectiveness be compromised by more cleverly crafted spam? We survey current and proposed spam filtering techniques with particular emphasis on how well they work. Our primary focus is spam filtering in email; Similarities and differences with spam filtering in other communication and storage media — such as instant messaging and the Web — are addressed peripherally. In doing so we examine the definition of spam, the user's information requirements and the role of the spam filter as one component of a large and complex information universe. Well-known methods are detailed sufficiently to make the exposition self-contained, however, the focus is on considerations unique to spam. Comparisons, wherever possible, use common evaluation measures, and control for differences in experimental setup. Such comparisons are not easy, as benchmarks, measures, and methods for evaluating spam filters are still evolving. We survey these efforts, their results and their limitations. In spite of recent advances in evaluation methodology, many uncertainties (including widely held but unsubstantiated beliefs) remain as to the effectiveness of spam filtering techniques and as to the validity of spam filter evaluation methods. We outline several uncertainties and propose experimental methods to address them.