Boosting the Performance of Web Spam Detection with Ensemble Under-Sampling Classification

Authors:
Guang-Gang Geng;Chun-Heng Wang;Qiu-Dan Li;Lei Xu;Xiao-Bo Jin
Affiliations:
Chinese Academy of Sciences;Chinese Academy of Sciences;Chinese Academy of Sciences;Chinese Academy of Sciences;Chinese Academy of Sciences
Venue:
FSKD '07 Proceedings of the Fourth International Conference on Fuzzy Systems and Knowledge Discovery - Volume 04
Year:
2007

Citing 0
Cited 7

Improving web spam detection with re-extracted features

Proceedings of the 17th international conference on World Wide Web
Identifying web spam with user behavior analysis

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
An empirical comparison of repetitive undersampling techniques

IRI'09 Proceedings of the 10th IEEE international conference on Information Reuse & Integration
Improving spamdexing detection via a two-stage classification strategy

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Learning to detect web spam by genetic programming

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Identifying Web Spam with the Wisdom of the Crowds

ACM Transactions on the Web (TWEB)
Content-based analysis to detect Arabic web spam

Journal of Information Science

Quantified Score

Hi-index	0.01

Visualization

Abstract

Anti-spam has become one of the top challenges for the Web search. In this paper, we explore the web spam de- tection as a binary classification problem. Based on the fact that reputable pages are more easy to be obtained than spam ones on the Web, an ensemble under-sampling classi- fication strategy is adopted, which exploits the information involved in the large number of reputable websites to full advantage. The strategy is based on the predicted spamic- ity of every sub-classifiers, in which both content-based and link-based features are taken into account. The experiments on standard WEBSPAM-UK2006 benchmark showed that the ensemble strategy can improve the web spam detection performance effectively.