A large-scale study of link spam detection by graph algorithms

  • Authors:
  • Hiroo Saito;Masashi Toyoda;Masaru Kitsuregawa;Kazuyuki Aihara

  • Affiliations:
  • Aihara Complexity Modelling Project, ERATO, JST, Tokyo, Japan and University of Tokyo, Tokyo, Japan;University of Tokyo, Tokyo, Japan;University of Tokyo, Tokyo, Japan;Aihara Complexity Modelling Project, ERATO, JST, Tokyo, Japan and University of Tokyo, Tokyo, Japan

  • Venue:
  • AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Link spam refers to attempts to promote the ranking of spammers' web sites by deceiving link-based ranking algorithms in search engines. Spammers often create densely connected link structure of sites so called "link farm". In this paper, we study the overall structure and distribution of link farms in a large-scale graph of the Japanese Web with 5.8 million sites and 283 million links. To examine the spam structure, we apply three graph algorithms to the web graph. First, the web graph is decomposed into strongly connected components (SCC). Beside the largest SCC (core) in the center of the web, we have observed that most of large components consist of link farms. Next, to extract spam sites in the core, we enumerate maximal cliques as seeds of link farms. Finally, we expand these link farms as a reliable spam seed set by a minimum cut technique that separates links among spam and non-spam sites. We found about 0.6 million spam sites in SCCs around the core, and extracted additional 8 thousand and 49 thousand sites as spams with high precision in the core by the maximal clique enumeration and by the minimum cut technique, respectively.