Approximation algorithms for NP-hard problems
FreeSpan: frequent pattern-projected sequential pattern mining
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
SPADE: an efficient algorithm for mining frequent sequences
Machine Learning
Mining Sequential Patterns: Generalizations and Performance Improvements
EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth
Proceedings of the 17th International Conference on Data Engineering
The PSP Approach for Mining Sequential Patterns
PKDD '98 Proceedings of the Second European Symposium on Principles of Data Mining and Knowledge Discovery
Sequential PAttern mining using a bitmap representation
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining Top.K Frequent Closed Patterns without Minimum Support
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
BIDE: Efficient Mining of Frequent Closed Sequences
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Proceedings of the 16th international conference on World Wide Web
Latent dirichlet allocation in web spam filtering
AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Email Spam Filtering: A Systematic Review
Foundations and Trends in Information Retrieval
Web spam identification through language model analysis
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Better Naive Bayes classification for high-precision spam detection
Software—Practice & Experience
On the relative age of spam and ham training samples for email filtering
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
A co-classification framework for detecting web spam and spammers in social media web sites
Proceedings of the 18th ACM conference on Information and knowledge management
Ensembles in adversarial classification for spam
Proceedings of the 18th ACM conference on Information and knowledge management
Comment spam injection made easy
CCNC'09 Proceedings of the 6th IEEE Conference on Consumer Communications and Networking Conference
CAPTCHA: using hard AI problems for security
EUROCRYPT'03 Proceedings of the 22nd international conference on Theory and applications of cryptographic techniques
Topic-driven reader comments summarization
Proceedings of the 21st ACM international conference on Information and knowledge management
Mind the gap: large-scale frequent sequence mining
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Going beyond Corr-LDA for detecting specific comments on news & blogs
Proceedings of the 7th ACM international conference on Web search and data mining
Hi-index | 0.00 |
Comments are supported by several web sites to increase user participation. Users can usually comment on a variety of media types - photos, videos, news articles, blogs, etc. Comment spam is one of the biggest challenges facing this feature. The traditional approach to combat spam is to train classifiers using various machine learning techniques. Since the commonly used classifiers work on the entire comment text, it is easy to mislead them by embedding spam content in good content. In this paper, we make several contributions towards comment spam detection. (1) We propose a new framework for spam detection that is immune to embed attacks. We characterize spam by a set of frequently occurring sequential patterns. (2) We introduce a variant (called min-closed) of the frequent closed sequence mining problem that succinctly captures all the frequently occurring patterns. We prove as well as experimentally show that the set of min-closed sequences is an order of magnitude smaller than the set of closed sequences and yet has exactly the same coverage. (3) We describe MCPRISM, extension of the recently published PRISM algorithm that effectively mines min-closed sequences, using prime encoding. In the process, we solve the open problem of using the prime-encoding technique to speed up traditional closed sequence mining. (4) We finally need to whittle down the set of frequent subsequences to a small set without sacrificing coverage. This problem is NP-Hard but we show that the coverage function is submodular and hence the greedy heuristic gives a fast algorithm that is close to optimal. We then describe the experiments that were carried out on a large real world comment data and the publicly available Gazelle dataset. (1) We show that nearly 80% of spam on real world data can be effectively captured by the mined sequences at very low false positive rates. (2) The sequences mined are highly discriminative. (3) On Gazelle data, the proposed algorithmic enhancements are faster by at least by a factor and by an order of magnitude on the larger comment dataset.