DirichletRank: Solving the zero-one gap problem of PageRank

Authors:
Xuanhui Wang;Tao Tao;Jian-Tao Sun;Azadeh Shakery;Chengxiang Zhai
Affiliations:
University of Illinois at Urbana-Champaign, Urbana, IL;University of Illinois at Urbana-Champaign, Urbana, IL;Microsoft Research Asia, Beijing, China;University of Illinois at Urbana-Champaign, Urbana, IL;University of Illinois at Urbana-Champaign, Urbana, IL
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2008

Citing 33
Cited 3

A Bayesian Method for the Induction of Probabilistic Networks from Data

Machine Learning
Randomized algorithms

Randomized algorithms
Learning Bayesian Networks: The Combination of Knowledge and Statistical Data

Machine Learning
Improved algorithms for topic distillation in a hyperlinked environment

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Automatic resource compilation by analyzing hyperlink structure and associated text

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Finding authorities and hubs from link structures on the World Wide Web

Proceedings of the 10th international conference on World Wide Web
Stable algorithms for link analysis

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A study of smoothing methods for language models applied to Ad Hoc information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Web page scoring systems for horizontal and vertical search

Proceedings of the 11th international conference on World Wide Web
Topic-sensitive PageRank

Proceedings of the 11th international conference on World Wide Web
Improvement of HITS-based algorithms on web documents

Proceedings of the 11th international conference on World Wide Web
Two-stage language models for information retrieval

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Learning to Probabilistically Identify Authoritative Documents

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Extrapolation methods for accelerating PageRank computations

WWW '03 Proceedings of the 12th international conference on World Wide Web
Scaling personalized web search

WWW '03 Proceedings of the 12th international conference on World Wide Web
Ranking the web frontier

Proceedings of the 13th international conference on World Wide Web
Block-level link analysis

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Rank-Stability and Rank-Similarity of Link-Based Web Ranking Algorithms in Authority-Connected Graphs

Information Retrieval
Inside PageRank

ACM Transactions on Internet Technology (TOIT)
Link analysis ranking: algorithms, theory, and experiments

ACM Transactions on Internet Technology (TOIT)
PageRank as a function of the damping factor

WWW '05 Proceedings of the 14th international conference on World Wide Web
A uniform approach to accelerated PageRank computation

WWW '05 Proceedings of the 14th international conference on World Wide Web
Identifying link farm spam pages

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Exploiting the hierarchical structure for link analysis

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Dirichlet PageRank

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Link spam alliances

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Topical TrustRank: using topicality to combat web spam

Proceedings of the 15th international conference on World Wide Web
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
Generalizing PageRank: damping functions for link-based ranking algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Combating web spam with trustrank

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30

Quasi-stationary distributions as centrality measures for the giant strongly connected component of a reducible graph

Journal of Computational and Applied Mathematics
Expertise ranking using activity and contextual link measures

Data & Knowledge Engineering
Detecting Fake Medical Web Sites Using Recursive Trust Labeling

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Link-based ranking algorithms are among the most important techniques to improve web search. In particular, the PageRank algorithm has been successfully used in the Google search engine and has been attracting much attention recently. However, we find that PageRank has a “zero-one gap” problem which, to the best of our knowledge, has not been addressed in any previous work. This problem can be potentially exploited to spam PageRank results and make the state-of-the-art link-based antispamming techniques ineffective. The zero-one gap problem arises as a result of the current ad hoc way of computing transition probabilities in the random surfing model. We therefore propose a novel DirichletRank algorithm which calculates these probabilities using Bayesian estimation with a Dirichlet prior. DirichletRank is a variant of PageRank, but does not have the problem of zero-one gap and can be analytically shown substantially more resistant to some link spams than PageRank. Experiment results on TREC data show that DirichletRank can achieve better retrieval accuracy than PageRank due to its more reasonable allocation of transition probabilities. More importantly, experiments on the TREC dataset and another real web dataset from the Webgraph project show that, compared with the original PageRank, DirichletRank is more stable under link perturbation and is significantly more robust against both manually identified web spams and several simulated link spams. DirichletRank can be computed as efficiently as PageRank, and thus is scalable to large-scale web applications.