Effectively Detecting Content Spam on the Web Using Topical Diversity Measures

Authors:
Cailing Dong;Bin Zhou
Affiliations:
-;-
Venue:
WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Year:
2012

Citing 14
Cited 0

Latent dirichlet allocation

The Journal of Machine Learning Research
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Detecting phrase-level duplication on the world wide web

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Link spam alliances

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
Link spam detection based on mass estimation

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Introduction to Information Retrieval

Introduction to Information Retrieval
Latent dirichlet allocation in web spam filtering

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Web spam identification through language model analysis

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Linked latent Dirichlet allocation in web spam filtering

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Link spam target detection using page farms

ACM Transactions on Knowledge Discovery from Data (TKDD)
Web spam detection: new classification features based on qualified link analysis and language models

IEEE Transactions on Information Forensics and Security
Web spam classification: a few features worth more

Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
Survey on web spam detection: principles and algorithms

ACM SIGKDD Explorations Newsletter

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent studies about web spam detection have utilized various content-based and link-based features to construct a spam classification model. In this paper, we conduct a thorough analysis of content spam on the web using topic models and propose several novel topical diversity measures for content spam detection. We adopt the web spam benchmark data set WEBSPAM-UK2007 for evaluation, and the experimental results verify that by integrating our topical diversity measures the performance of the state-of-the-art web spam detection methods can be greatly improved. In addition, comparing to existing features for training spam classification models, our topical diversity measures can achieve high spam detection performance using small set of training data. In personalized web spam detection, the training data (i.e., user's spam labeling results) are typically small. Our finding makes personalized web spam detection highly achievable. We develop an efficient and effective regression model using topical diversity measures for personalized web spam detection, and present some promising results obtained from an empirical study.