Focused crawling using navigational rank

Authors:
Shicong Feng;Li Zhang;Yuhong Xiong;Conglei Yao
Affiliations:
HP Labs China, Beijing, China;Microsoft Research Silicon Valley, Mountain view, CA, USA;Innovation Works, http://www.innovation-works.com, Beijing, China;HP Labs China, Beijing, China
Venue:
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Year:
2010

Citing 5
Cited 1

Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Topic-sensitive PageRank

Proceedings of the 11th international conference on World Wide Web
Using Reinforcement Learning to Spider the Web Efficiently

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Dynamic personalized pagerank in entity-relation graphs

Proceedings of the 16th international conference on World Wide Web

A novel focused crawler based on breadcrumb navigation

ICSI'12 Proceedings of the Third international conference on Advances in Swarm Intelligence - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

The goal of focused crawling is to use limited resources to effectively discover web pages related to a specific topic rather than downloading all accessible web documents. The major challenge in focused crawling is how to effectively determine each hyperlink's capability of leading to target pages. To compute this capability, we present a novel approach, called Navigational Rank (NR). In general, NR is a kind of two-step and two-direction credit propagation approach. Compared to existing methods, NR mainly has three advantages. First, NR is dynamically updated during the crawling progress, which can adapt to different website structures very well. Second, when the crawling seed is far away from the target pages, and the target pages only constitute a small portion of the whole website, NR shows a significant performance advantage. Third, NR computes each link's capability of leading to target pages by considering both the target and non-target pages it leads to. This global knowledge causes a better performance than only using target pages. We have performed extensive experiments for performance evaluation of the proposed approach using two groups of large-scale, real-world datasets from two different domains. The experimental results show that our approach is domain-independent and significantly outperforms the state-of-arts.