Robust PageRank and locally computable spam detection features

Authors:
Reid Andersen;Christian Borgs;Jennifer Chayes;John Hopcroft;Kamal Jain;Vahab Mirrokni;Shanghua Teng
Affiliations:
Microsoft Research, Redmond;Microsoft Research, Redmond;Microsoft Research, Redmond;Cornell University;Microsoft Research, Redmond;Microsoft Research, Redmond;Boston University
Venue:
AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Year:
2008

Citing 11
Cited 10

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Scaling personalized web search

WWW '03 Proceedings of the 12th international conference on World Wide Web
Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search

IEEE Transactions on Knowledge and Data Engineering
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
To randomize or not to randomize: space optimal summaries for hyperlink analysis

Proceedings of the 15th international conference on World Wide Web
Link spam detection based on mass estimation

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Transductive link spam detection

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Combating web spam with trustrank

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Challenges in web search engines

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Local computation of PageRank contributions

WAW'07 Proceedings of the 5th international conference on Algorithms and models for the web-graph

Looking into the past to better classify web spam

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Nullification test collections for web spam and SEO

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
A brief survey of computational approaches in social computing

IJCNN'09 Proceedings of the 2009 international joint conference on Neural Networks
False-name-proofness in social networks

WINE'10 Proceedings of the 6th international conference on Internet and network economics
Using patterns in the behavior of the random surfer to detect webspam beneficiaries

WISS'10 Proceedings of the 2010 international conference on Web information systems engineering
Webspam demotion: Low complexity node aggregation methods

Neurocomputing
The laplacian paradigm: emerging algorithms for massive graphs

TAMC'10 Proceedings of the 7th annual conference on Theory and Applications of Models of Computation
Reliability prediction of webpages in the medical domain

ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
Querying provenance for ranking and recommending

TaPP'12 Proceedings of the 4th USENIX conference on Theory and Practice of Provenance
Detecting Webspam Beneficiaries Using Information Collected by the Random Surfer

International Journal of Organizational and Collective Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Since the link structure of the web is an important element in ranking systems on search engines, web spammers widely use the link structure of the web to increase the rank of their pages. Various link-based features of web pages have been introduced and have proven effective at identifying link spam. One particularly successful family of features (as described in the SpamRank algorithm), is based on examining the sets of pages that contribute most to the PageRank of a given vertex, called supporting sets. In a recent paper, the current authors described an algorithm for efficiently computing, for a single specified vertex, an approximation of its supporting sets. In this paper, we describe several link-based spam-detection features, both supervised and unsupervised, that can be derived from these approximate supporting sets. In particular, we examine the size of a node's supporting sets and the approximate l2 norm of the PageRank contributions from other nodes. As a supervised feature, we examine the composition of a node's supporting sets. We perform experiments on two labeled real data sets to demonstrate the effectiveness of these features for spam detection, and demonstrate that these features can be computed efficiently. Furthermore, we design a variation of PageRank (called Robust PageRank) that incorporates some of these features into its ranking, argue that this variation is more robust against link spam engineering, and give an algorithm for approximating Robust PageRank.