Comment spam detection by sequence mining

Authors:
Ravi Kant;Srinivasan H. Sengamedu;Krishnan S. Kumar
Affiliations:
Yahoo Labs, Bangalore, India;Komli Labs, Bangalore, India;Yahoo Labs, Bangalore, India
Venue:
Proceedings of the fifth ACM international conference on Web search and data mining
Year:
2012

Citing 20
Cited 3

Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems

Approximation algorithms for NP-hard problems
FreeSpan: frequent pattern-projected sequential pattern mining

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
SPADE: an efficient algorithm for mining frequent sequences

Machine Learning
Mining Sequential Patterns: Generalizations and Performance Improvements

EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
Mining Sequential Patterns

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth

Proceedings of the 17th International Conference on Data Engineering
The PSP Approach for Mining Sequential Patterns

PKDD '98 Proceedings of the Second European Symposium on Principles of Data Mining and Knowledge Discovery
Sequential PAttern mining using a bitmap representation

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining Top.K Frequent Closed Patterns without Minimum Support

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
BIDE: Efficient Mining of Frequent Closed Sequences

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Review spam detection

Proceedings of the 16th international conference on World Wide Web
Latent dirichlet allocation in web spam filtering

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Email Spam Filtering: A Systematic Review

Foundations and Trends in Information Retrieval
Web spam identification through language model analysis

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Better Naive Bayes classification for high-precision spam detection

Software—Practice & Experience
On the relative age of spam and ham training samples for email filtering

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
A co-classification framework for detecting web spam and spammers in social media web sites

Proceedings of the 18th ACM conference on Information and knowledge management
Ensembles in adversarial classification for spam

Proceedings of the 18th ACM conference on Information and knowledge management
Comment spam injection made easy

CCNC'09 Proceedings of the 6th IEEE Conference on Consumer Communications and Networking Conference
CAPTCHA: using hard AI problems for security

EUROCRYPT'03 Proceedings of the 22nd international conference on Theory and applications of cryptographic techniques

Topic-driven reader comments summarization

Proceedings of the 21st ACM international conference on Information and knowledge management
Mind the gap: large-scale frequent sequence mining

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Going beyond Corr-LDA for detecting specific comments on news & blogs

Proceedings of the 7th ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Comments are supported by several web sites to increase user participation. Users can usually comment on a variety of media types - photos, videos, news articles, blogs, etc. Comment spam is one of the biggest challenges facing this feature. The traditional approach to combat spam is to train classifiers using various machine learning techniques. Since the commonly used classifiers work on the entire comment text, it is easy to mislead them by embedding spam content in good content. In this paper, we make several contributions towards comment spam detection. (1) We propose a new framework for spam detection that is immune to embed attacks. We characterize spam by a set of frequently occurring sequential patterns. (2) We introduce a variant (called min-closed) of the frequent closed sequence mining problem that succinctly captures all the frequently occurring patterns. We prove as well as experimentally show that the set of min-closed sequences is an order of magnitude smaller than the set of closed sequences and yet has exactly the same coverage. (3) We describe MCPRISM, extension of the recently published PRISM algorithm that effectively mines min-closed sequences, using prime encoding. In the process, we solve the open problem of using the prime-encoding technique to speed up traditional closed sequence mining. (4) We finally need to whittle down the set of frequent subsequences to a small set without sacrificing coverage. This problem is NP-Hard but we show that the coverage function is submodular and hence the greedy heuristic gives a fast algorithm that is close to optimal. We then describe the experiments that were carried out on a large real world comment data and the publicly available Gazelle dataset. (1) We show that nearly 80% of spam on real world data can be effectively captured by the mined sequences at very low false positive rates. (2) The sequences mined are highly discriminative. (3) On Gazelle data, the proposed algorithmic enhancements are faster by at least by a factor and by an order of magnitude on the larger comment dataset.