Spam Filtering Using Statistical Data Compression Models

Authors:
Andrej Bratko;Bogdan Filipič;Gordon V. Cormack;Thomas R. Lynam;Blaž Zupan
Affiliations:
-;-;-;-;-
Venue:
The Journal of Machine Learning Research
Year:
2006

Citing 19
Cited 51

Data compression using dynamic Markov modelling

The Computer Journal
An estimate of an upper bound for the entropy of English

Computational Linguistics
The design and analysis of efficient lossless data compression systems

The design and analysis of efficient lossless data compression systems
Making large-scale support vector machine learning practical

Advances in kernel methods
Evaluating cost-sensitive Unsolicited Bulk Email categorization

Proceedings of the 2002 ACM symposium on Applied computing
A statistical approach to the spam problem

Linux Journal
A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists

Information Retrieval
The similarity metric

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Text Categorization Using Compression Models

DCC '00 Proceedings of the Conference on Data Compression
Augmenting Naive Bayes Classifiers with Statistical Language Models

Information Retrieval
Towards parameter-free data mining

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
A comparison of event models for Naive Bayes anti-spam e-mail filtering

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Compression and Machine Learning: A New Perspective on Feature Space Vectors

DCC '06 Proceedings of the Data Compression Conference
Exploiting structural information for semi-structured document categorization

Information Processing and Management: an International Journal
A suffix tree approach to anti-spam email filtering

Machine Learning
Hackers & Painters: Big Ideas from the Computer Age

Hackers & Painters: Big Ideas from the Computer Age
Fisher information and stochastic complexity

IEEE Transactions on Information Theory
The minimum description length principle in coding and modeling

IEEE Transactions on Information Theory
The context-tree weighting method: basic properties

IEEE Transactions on Information Theory

Spam and the ongoing battle for the inbox

Communications of the ACM - Spam and the ongoing battle for the inbox
Machine Learning for Computer Security

The Journal of Machine Learning Research
Spam filtering for short messages

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Textual case-based reasoning for spam filtering: a comparison of feature-based and feature-free approaches

Artificial Intelligence Review
Detecting spam email by radial basis function networks

International Journal of Knowledge-based and Intelligent Engineering Systems
Semi-supervised spam filtering: does it work?

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Asymmetric support vector machines: low false-positive learning under the user tolerance

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Catching the Drift: Using Feature-Free Case-Based Reasoning for Spam Filtering

ICCBR '07 Proceedings of the 7th international conference on Case-Based Reasoning: Case-Based Reasoning Research and Development
Anticipating Hidden Text Salting in Emails

RAID '08 Proceedings of the 11th international symposium on Recent Advances in Intrusion Detection
Latent dirichlet allocation in web spam filtering

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Email Spam Filtering: A Systematic Review

Foundations and Trends in Information Retrieval
Malware detection using adaptive data compression

Proceedings of the 1st ACM workshop on Workshop on AISec
Unsupervised Spam Detection by Document Complexity Estimation

DS '08 Proceedings of the 11th International Conference on Discovery Science
Linked latent Dirichlet allocation in web spam filtering

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Review: A review of machine learning approaches to Spam filtering

Expert Systems with Applications: An International Journal
Genre-based decomposition of email class noise

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Spam filter evaluation with imprecise ground truth

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Using dynamic markov compression to detect vandalism in the wikipedia

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
A survey of learning-based techniques of email spam filtering

Artificial Intelligence Review
Study on Ensemble Classification Methods towards Spam Filtering

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
An effective and robust method for short text classification

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 3
New filtering approaches for phishing email

Journal of Computer Security - EU-Funded ICT Research on Trust and Security
Filtering spams using the minimum description length principle

Proceedings of the 2010 ACM Symposium on Applied Computing
Uncovering social spammers: social honeypots + machine learning

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
"Got you!": automatic vandalism detection in Wikipedia with web-based shallow syntactic-semantic modeling

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Detecting algorithmically generated malicious domain names

IMC '10 Proceedings of the 10th ACM SIGCOMM conference on Internet measurement
Using biased discriminant analysis for email filtering

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part I
Identifying and resolving hidden text salting

IEEE Transactions on Information Forensics and Security
Enhanced email spam filtering through combining similarity graphs

Proceedings of the fourth ACM international conference on Web search and data mining
Compression for anti-adversarial learning

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II
Enhancing scalability in anomaly-based email spam filtering

Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference
Differentiating code from data in x86 binaries

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part III
Enhanced Topic-based Vector Space Model for semantics-aware spam filtering

Expert Systems with Applications: An International Journal
PCA document reconstruction for email classification

Computational Statistics & Data Analysis
Tweet classification by data compression

Proceedings of the 2011 international workshop on DETecting and Exploiting Cultural diversiTy on the social web
Text mining and probabilistic language modeling for online review spam detection

ACM Transactions on Management Information Systems (TMIS)
Facing the spammers: A very effective approach to avoid junk e-mails

Expert Systems with Applications: An International Journal
Segmental parameterisation and statistical modelling of e-mail headers for spam detection

Information Sciences: an International Journal
Modeling sequences of user actions for statistical goal recognition

User Modeling and User-Adapted Interaction
Spam filtering using semantic similarity approach and adaptive BPNN

Neurocomputing
Word sense disambiguation for spam filtering

Electronic Commerce Research and Applications
Evasion attack of multi-class linear classifiers

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Impact of spam exposure on user engagement

Security'12 Proceedings of the 21st USENIX conference on Security symposium
Robust detection of comment spam using entropy rate

Proceedings of the 5th ACM workshop on Security and artificial intelligence
Language identification for creating language-specific Twitter collections

LSM '12 Proceedings of the Second Workshop on Language in Social Media
Detecting algorithmically generated domain-flux attacks with DNS traffic analysis

IEEE/ACM Transactions on Networking (TON)
A Self-Supervised Approach to Comment Spam Detection Based on Content Analysis

International Journal of Information Security and Privacy
Cross-lingual web spam classification

Proceedings of the 22nd international conference on World Wide Web companion
Reversing the effects of tokenisation attacks against content-based spam filters

International Journal of Security and Networks
Dictionary-based color image retrieval using multiset theory

Journal of Visual Communication and Image Representation
Campaign extraction from social media

ACM Transactions on Intelligent Systems and Technology (TIST) - Special Section on Intelligent Mobile Knowledge Discovery and Management Systems and Special Issue on Social Web Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task calls for fast, incremental and robust learning algorithms. In this paper, we investigate a novel approach to spam filtering based on adaptive statistical data compression models. The nature of these models allows them to be employed as probabilistic text classifiers based on character-level or binary sequences. By modeling messages as sequences, tokenization and other error-prone preprocessing steps are omitted altogether, resulting in a method that is very robust. The models are also fast to construct and incrementally updateable. We evaluate the filtering performance of two different compression algorithms; dynamic Markov compression and prediction by partial matching. The results of our empirical evaluation indicate that compression models outperform currently established spam filters, as well as a number of methods proposed in previous studies.