A Forwarding-Based Task Scheduling Algorithm for Distributed Web Crawling over DHTs

  • Authors:
  • Xiao Xu;Wei-Zhe Zhang;Hong-Li Zhang;Bin-Xing Fang;Xin-Ran Liu

  • Affiliations:
  • -;-;-;-;-

  • Venue:
  • ICPADS '09 Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Distributed Web crawling (DWC) over DHTs is proposed to solve the bottlenecks in the traditional Web crawling. The core of this kind of system is its fully distributed task scheduling mechanism in which the crawlers are treated as peers and the crawlees are treated as resources maintained by the peers. A system model based on the Content Addressable Network (CAN) can further optimize the scheduling mechanism by exploiting the network proximity of the crawlers and the crawlees. In this paper, we propose a new method for CAN in order to achieve load balancing in the CAN-based DWC system. The method not only keeps the load balancing among peers but also keeps the distance between peers and resources very short in our simulations. The shortened peer-resource distance fulfills the need of shortening crawler-crawlee latencies.