Extracting link spam using biased random walks from spam seed sets

  • Authors:
  • Baoning Wu;Kumar Chellapilla

  • Affiliations:
  • Lehigh University, Bethlehem, PA;Microsoft Live Labs, Redmond, WA

  • Venue:
  • AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Link spam deliberately manipulates hyperlinks between web pages in order to unduly boost the search engine ranking of one or more target pages. Link based ranking algorithms such as PageRank, HITS, and other derivatives are especially vulnerable to link spam. Link farms and link exchanges are two common instances of link spam that produce spam communities -- i.e., clusters in the web graph. In this paper, we present a directed approach to extracting link spam communities when given one or more members of the community. In contrast to previous completely automated approaches to finding link spam, our method is specifically designed to be used interactively. Our approach starts with a small spam seed set provided by the user and simulates a random walk on the web graph. The random walk is biased to explore the local neighborhood around the seed set through the use of decay probabilities. Truncation is used to retain only the most frequently visited nodes. After termination, the nodes are sorted in decreasing order of their final probabilities and presented to the user. Experiments using manually labeled link spam data sets and random walks from a single seed domain show that the approach achieves over 95.12% precision in extracting large link farms and 80.46% precision in extracting link exchange centroids.