Better Naive Bayes classification for high-precision spam detection

Authors:
Yang Song;Aleksander Kołcz;C. Lee Giles
Affiliations:
Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, U.S.A.;Microsoft Live Labs, One Microsoft Way, Redmond, WA 98052, U.S.A.;College of Information Science and Technology, The Pennsylvania State University, University Park, PA 16802, U.S.A.
Venue:
Software—Practice & Experience
Year:
2009

Citing 0
Cited 4

A survey of emerging approaches to spam filtering

ACM Computing Surveys (CSUR)
Comment spam detection by sequence mining

Proceedings of the fifth ACM international conference on Web search and data mining
Using probabilistic generative models for ranking risks of Android apps

Proceedings of the 2012 ACM conference on Computer and communications security
RssE-Miner: a new approach for efficient events mining from social media RSS feeds

DaWaK'12 Proceedings of the 14th international conference on Data Warehousing and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

Email spam has become a major problem for Internet users and providers. One major obstacle to its eradication is that the potential solutions need to ensure a very low false-positive rate, which tends to be difficult in practice. We address the problem of low-FPR classification in the context of naive Bayes, which represents one of the most popular machine learning models applied in the spam filtering domain. Drawing from the recent extensions, we propose a new term weight aggregation function, which leads to markedly better results than the standard alternatives. We identify short instances as ones with disproportionally poor performance and counter this behavior with a collaborative filtering-based feature augmentation. Finally, we propose a tree-based classifier cascade for which decision thresholds of the leaf nodes are jointly optimized for the best overall performance. These improvements, both individually and in aggregate, lead to substantially better detection rate of precision when compared with some of the best variants of naive Bayes proposed to date. Copyright © 2009 John Wiley & Sons, Ltd. This work was done when the first author was an intern at Microsoft Live Labs Research.