A limited memory algorithm for bound constrained optimization
SIAM Journal on Scientific Computing
Understanding the network-level behavior of spammers
Proceedings of the 2006 conference on Applications, technologies, architectures, and protocols for computer communications
Spam Filtering Using Statistical Data Compression Models
The Journal of Machine Learning Research
Relaxed online SVMs for spam filtering
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Filtering spam with behavioral blacklisting
Proceedings of the 14th ACM conference on Computer and communications security
Characterizing botnets from email spam records
LEET'08 Proceedings of the 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats
Spamming botnets: signatures and characteristics
Proceedings of the ACM SIGCOMM 2008 conference on Data communication
Ranking Comments on the Social Web
CSE '09 Proceedings of the 2009 International Conference on Computational Science and Engineering - Volume 04
Uncovering social spammers: social honeypots + machine learning
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
The Journal of Machine Learning Research
Detecting and characterizing social spam campaigns
IMC '10 Proceedings of the 10th ACM SIGCOMM conference on Internet measurement
User reputation in a comment rating environment
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Semi-supervised correction of biased comment ratings
Proceedings of the 21st international conference on World Wide Web
The minimum description length principle in coding and modeling
IEEE Transactions on Information Theory
Approaches to adversarial drift
Proceedings of the 2013 ACM workshop on Artificial intelligence and security
Hi-index | 0.00 |
In this work, we design a method for blog comment spam detection using the assumption that spam is any kind of uninformative content. To measure the "informativeness" of a set of blog comments, we construct a language and tokenization independent metric which we call content complexity, providing a normalized answer to the informal question "how much information does this text contain?" We leverage this metric to create a small set of features well-adjusted to comment spam detection by computing the content complexity over groupings of messages sharing the same author, the same sender IP, the same included links, etc. We evaluate our method against an exact set of tens of millions of comments collected over a four months period and containing a variety of websites, including blogs and news sites. The data was provided to us with an initial spam labeling from an industry competitive source. Nevertheless the initial spam labeling had unknown performance characteristics. To train a logistic regression on this dataset using our features, we derive a simple mislabeling tolerant logistic regression algorithm based on expectation-maximization, which we show generally outperforms the plain version in precision-recall space. By using a parsimonious hand-labeling strategy, we show that our method can operate at an arbitrary high precision level, and that it significantly dominates, both in terms of precision and recall, the original labeling, despite being trained on it alone. The content complexity metric, the use of a noise-tolerant logistic regression and the evaluation methodology are thus the three central contributions with this work.