Boosting the Performance of Web Spam Detection with Ensemble Under-Sampling Classification

  • Authors:
  • Guang-Gang Geng;Chun-Heng Wang;Qiu-Dan Li;Lei Xu;Xiao-Bo Jin

  • Affiliations:
  • Chinese Academy of Sciences;Chinese Academy of Sciences;Chinese Academy of Sciences;Chinese Academy of Sciences;Chinese Academy of Sciences

  • Venue:
  • FSKD '07 Proceedings of the Fourth International Conference on Fuzzy Systems and Knowledge Discovery - Volume 04
  • Year:
  • 2007

Quantified Score

Hi-index 0.01

Visualization

Abstract

Anti-spam has become one of the top challenges for the Web search. In this paper, we explore the web spam de- tection as a binary classification problem. Based on the fact that reputable pages are more easy to be obtained than spam ones on the Web, an ensemble under-sampling classi- fication strategy is adopted, which exploits the information involved in the large number of reputable websites to full advantage. The strategy is based on the predicted spamic- ity of every sub-classifiers, in which both content-based and link-based features are taken into account. The experiments on standard WEBSPAM-UK2006 benchmark showed that the ensemble strategy can improve the web spam detection performance effectively.