The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
An adaptive model for optimizing performance of an incremental web crawler
Proceedings of the 10th international conference on World Wide Web
SETI@HOME—massively distributed computing for SETI
Computing in Science and Engineering
Optimal crawling strategies for web search engines
Proceedings of the 11th international conference on World Wide Web
Mercator: A scalable, extensible Web crawler
World Wide Web
The Evolution of the Web and Implications for an Incremental Crawler
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Effective page refresh policies for Web crawlers
ACM Transactions on Database Systems (TODS)
Vivaldi: a decentralized network coordinate system
Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications
UbiCrawler: a scalable fully distributed web crawler
Software—Practice & Experience
WWW '05 Proceedings of the 14th international conference on World Wide Web
IRLbot: scaling to 6 billion pages and beyond
Proceedings of the 17th international conference on World Wide Web
Recrawl scheduling based on information longevity
Proceedings of the 17th international conference on World Wide Web
The web changes everything: understanding the dynamics of web content
Proceedings of the Second ACM International Conference on Web Search and Data Mining
A Forwarding-Based Task Scheduling Algorithm for Distributed Web Crawling over DHTs
ICPADS '09 Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems
Hi-index | 0.00 |
Large scale distributed Web crawling system using voluntarily contributed personal computing resources allows small companies to build their own search engines with very low cost. The biggest challenge for such system is how to implement the functionalities equivalent to that of the traditional search engines under a fluctuating distributed environment. One of the functionalities is incremental crawl which requires recrawl each Web site according to the update frequency of each Web site's content. However, recrawl intervals solely calculated from change frequency of the Web sites may mismatch the system's real-time capacity which leads to inefficient utilization of resources. Based on our previous works on a DHT-based Web crawling system, in this paper, we propose two scale-adaptable recrawl strategies aiming to find solutions to the above issue. The methods proposed are evaluated through simulations based on real Web datasets and show satisfactory results.