Robust PageRank and locally computable spam detection features

  • Authors:
  • Reid Andersen;Christian Borgs;Jennifer Chayes;John Hopcroft;Kamal Jain;Vahab Mirrokni;Shanghua Teng

  • Affiliations:
  • Microsoft Research, Redmond;Microsoft Research, Redmond;Microsoft Research, Redmond;Cornell University;Microsoft Research, Redmond;Microsoft Research, Redmond;Boston University

  • Venue:
  • AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Since the link structure of the web is an important element in ranking systems on search engines, web spammers widely use the link structure of the web to increase the rank of their pages. Various link-based features of web pages have been introduced and have proven effective at identifying link spam. One particularly successful family of features (as described in the SpamRank algorithm), is based on examining the sets of pages that contribute most to the PageRank of a given vertex, called supporting sets. In a recent paper, the current authors described an algorithm for efficiently computing, for a single specified vertex, an approximation of its supporting sets. In this paper, we describe several link-based spam-detection features, both supervised and unsupervised, that can be derived from these approximate supporting sets. In particular, we examine the size of a node's supporting sets and the approximate l2 norm of the PageRank contributions from other nodes. As a supervised feature, we examine the composition of a node's supporting sets. We perform experiments on two labeled real data sets to demonstrate the effectiveness of these features for spam detection, and demonstrate that these features can be computed efficiently. Furthermore, we design a variation of PageRank (called Robust PageRank) that incorporates some of these features into its ranking, argue that this variation is more robust against link spam engineering, and give an algorithm for approximating Robust PageRank.