Scale-adaptable recrawl strategies for DHT-based distributed web crawling system

Authors:
Xiao Xu;Weizhe Zhang;Hongli Zhang;Binxing Fang
Affiliations:
School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China;School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China;School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China;School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
Venue:
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Year:
2010

Citing 14
Cited 0

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
An adaptive model for optimizing performance of an incremental web crawler

Proceedings of the 10th international conference on World Wide Web
SETI@HOME—massively distributed computing for SETI

Computing in Science and Engineering
Optimal crawling strategies for web search engines

Proceedings of the 11th international conference on World Wide Web
Mercator: A scalable, extensible Web crawler

World Wide Web
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Effective page refresh policies for Web crawlers

ACM Transactions on Database Systems (TODS)
Vivaldi: a decentralized network coordinate system

Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications
UbiCrawler: a scalable fully distributed web crawler

Software—Practice & Experience
User-centric Web crawling

WWW '05 Proceedings of the 14th international conference on World Wide Web
IRLbot: scaling to 6 billion pages and beyond

Proceedings of the 17th international conference on World Wide Web
Recrawl scheduling based on information longevity

Proceedings of the 17th international conference on World Wide Web
The web changes everything: understanding the dynamics of web content

Proceedings of the Second ACM International Conference on Web Search and Data Mining
A Forwarding-Based Task Scheduling Algorithm for Distributed Web Crawling over DHTs

ICPADS '09 Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large scale distributed Web crawling system using voluntarily contributed personal computing resources allows small companies to build their own search engines with very low cost. The biggest challenge for such system is how to implement the functionalities equivalent to that of the traditional search engines under a fluctuating distributed environment. One of the functionalities is incremental crawl which requires recrawl each Web site according to the update frequency of each Web site's content. However, recrawl intervals solely calculated from change frequency of the Web sites may mismatch the system's real-time capacity which leads to inefficient utilization of resources. Based on our previous works on a DHT-based Web crawling system, in this paper, we propose two scale-adaptable recrawl strategies aiming to find solutions to the above issue. The methods proposed are evaluated through simulations based on real Web datasets and show satisfactory results.