Exploiting redundancy in natural language to penetrate Bayesian spam filters

Authors:
Christoph Karlberger;Günther Bayler;Christopher Kruegel;Engin Kirda
Affiliations:
Secure Systems Lab., Technical University Vienna;Secure Systems Lab., Technical University Vienna;Secure Systems Lab., Technical University Vienna;Secure Systems Lab., Technical University Vienna
Venue:
WOOT '07 Proceedings of the first USENIX workshop on Offensive Technologies
Year:
2007

Citing 3
Cited 8

A practical part-of-speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification

Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification
Image Analysis for Efficient Categorization of Image-based Spam E-mail

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition

Exploiting machine learning to subvert your spam filter

LEET'08 Proceedings of the 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats
Measurement and classification of humans and bots in internet chat

SS'08 Proceedings of the 17th conference on Security symposium
All your contacts are belong to us: automated identity theft attacks on social networks

Proceedings of the 18th international conference on World wide web
Comment spam injection made easy

CCNC'09 Proceedings of the 6th IEEE Conference on Consumer Communications and Networking Conference
Removing web spam links from search engine results

Journal in Computer Virology
Enhanced Topic-based Vector Space Model for semantics-aware spam filtering

Expert Systems with Applications: An International Journal
Humans and bots in internet chat: measurement, analysis, and automated classification

IEEE/ACM Transactions on Networking (TON)
Word sense disambiguation for spam filtering

Electronic Commerce Research and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Today's attacks against Bayesian spam filters attempt to keep the content of spam mails visible to humans, but obscured to filters. A common technique is to fool filters by appending additional words to a spam mail. Because these words appear very rarely in spam mails, filters are inclined to classify the mail as legitimate. The idea we present in this paper leverages the fact that natural language typically contains synonyms. Synonyms are different words that describe similar terms and concepts. Such words often have significantly different spam probabilities. Thus, an attacker might be able to penetrate Bayesian filters by replacing suspicious words by innocuous terms with the same meaning. A precondition for the success of such an attack is that Bayesian spam filters of different users assign similar spam probabilities to similar tokens. We first examine whether this precondition is met; afterwards, we measure the effectivity of an automated substitution attack by creating a test set of spam messages that are tested against SpamAssassin, DSPAM, and Gmail.