Reversing the effects of tokenisation attacks against content-based spam filters

Authors:
Igor Santos;Carlos Laorden;Borja Sanz;Pablo G. Bringas
Affiliations:
S³Lab, DeustoTech - Computing, Deusto Institute of Technology, University of Deusto, Avenida de las Universidades 24, 48007, Bilbao, Spain;S³Lab, DeustoTech - Computing, Deusto Institute of Technology, University of Deusto, Avenida de las Universidades 24, 48007, Bilbao, Spain;S³Lab, DeustoTech - Computing, Deusto Institute of Technology, University of Deusto, Avenida de las Universidades 24, 48007, Bilbao, Spain;S³Lab, DeustoTech - Computing, Deusto Institute of Technology, University of Deusto, Avenida de las Universidades 24, 48007, Bilbao, Spain
Venue:
International Journal of Security and Networks
Year:
2013

Citing 29
Cited 0

The automatic identification of stop words

Journal of Information Science
C4.5: programs for machine learning

C4.5: programs for machine learning
The nature of statistical learning theory

The nature of statistical learning theory
Solving the multiple instance problem with axis-parallel rectangles

Artificial Intelligence
Spam!

Communications of the ACM
A framework for multiple-instance learning

NIPS '97 Proceedings of the 1997 conference on Advances in neural information processing systems 10
Improving support vector machine classifiers by modifying kernal functions

Neural Networks
An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
A vector space model for automatic indexing

Communications of the ACM
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Expert Systems and Probabiistic Network Models

Expert Systems and Probabiistic Network Models
Modern Information Retrieval

Modern Information Retrieval
Random Forests

Machine Learning
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Induction of Decision Trees

Machine Learning
A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists

Information Retrieval
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
An empirical study of spam traffic and the use of DNS black lists

Proceedings of the 4th ACM SIGCOMM conference on Internet measurement
An evaluation of statistical spam filtering techniques

ACM Transactions on Asian Language Information Processing (TALIP)
A comparison of event models for Naive Bayes anti-spam e-mail filtering

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
Spam Filtering Using Statistical Data Compression Models

The Journal of Machine Learning Research
Relaxed online SVMs for spam filtering

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Social phishing

Communications of the ACM
Combating Good Word Attacks on Statistical Spam Filters with Multiple Instance Learning

ICTAI '07 Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence - Volume 02
An evaluation of Naive Bayes variants in content-based learning for spam filtering

Intelligent Data Analysis
A study of cross-validation and bootstrap for accuracy estimation and model selection

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Spam Detection: Technologies for spam detection

Network Security
Support vector machines for spam categorization

IEEE Transactions on Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

More than 85% of the received emails are spam. Many current solutions feature machine-learning algorithms trained using statistical representations of the terms that most commonly appear in such emails. However, there are attacks that can subvert the filtering capabilities of these methods. Tokenisation attacks insert characters within words, subverting these methods. In this paper, we introduce a new method that reverses the effects of tokenisation attacks. Our method processes emails iteratively by considering possible words, starting from the first token and compares the word candidates with a common dictionary to which spam words have been previously added. We provide an empirical study of how tokenisation attacks affect the filtering capability of a Bayesian classifier and we show that our method can reverse the effects of tokenisation attacks.