The automatic identification of stop words
Journal of Information Science
C4.5: programs for machine learning
C4.5: programs for machine learning
The nature of statistical learning theory
The nature of statistical learning theory
Solving the multiple instance problem with axis-parallel rectangles
Artificial Intelligence
Communications of the ACM
A framework for multiple-instance learning
NIPS '97 Proceedings of the 1997 conference on Advances in neural information processing systems 10
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
A vector space model for automatic indexing
Communications of the ACM
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Expert Systems and Probabiistic Network Models
Expert Systems and Probabiistic Network Models
Modern Information Retrieval
Machine Learning
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
Machine Learning
A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists
Information Retrieval
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval
ECML '98 Proceedings of the 10th European Conference on Machine Learning
An empirical study of spam traffic and the use of DNS black lists
Proceedings of the 4th ACM SIGCOMM conference on Internet measurement
An evaluation of statistical spam filtering techniques
ACM Transactions on Asian Language Information Processing (TALIP)
A comparison of event models for Naive Bayes anti-spam e-mail filtering
EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Pattern Recognition and Machine Learning (Information Science and Statistics)
Pattern Recognition and Machine Learning (Information Science and Statistics)
Spam Filtering Using Statistical Data Compression Models
The Journal of Machine Learning Research
Relaxed online SVMs for spam filtering
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Communications of the ACM
Combating Good Word Attacks on Statistical Spam Filters with Multiple Instance Learning
ICTAI '07 Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence - Volume 02
An evaluation of Naive Bayes variants in content-based learning for spam filtering
Intelligent Data Analysis
A study of cross-validation and bootstrap for accuracy estimation and model selection
IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Spam Detection: Technologies for spam detection
Network Security
Support vector machines for spam categorization
IEEE Transactions on Neural Networks
Hi-index | 0.00 |
More than 85% of the received emails are spam. Many current solutions feature machine-learning algorithms trained using statistical representations of the terms that most commonly appear in such emails. However, there are attacks that can subvert the filtering capabilities of these methods. Tokenisation attacks insert characters within words, subverting these methods. In this paper, we introduce a new method that reverses the effects of tokenisation attacks. Our method processes emails iteratively by considering possible words, starting from the first token and compares the word candidates with a common dictionary to which spam words have been previously added. We provide an empirical study of how tokenisation attacks affect the filtering capability of a Bayesian classifier and we show that our method can reverse the effects of tokenisation attacks.