Topic-oriented collaborative crawling

  • Authors:
  • Chiasen Chung;Charles L. A. Clarke

  • Affiliations:
  • University of Waterloo, Canada;University of Waterloo, Canada

  • Venue:
  • Proceedings of the eleventh international conference on Information and knowledge management
  • Year:
  • 2002

Quantified Score

Hi-index 0.01

Visualization

Abstract

A major concern in the implementation of a distributed Web crawler is the choice of a strategy for partitioning the Web among the nodes in the system. Our goal in selecting this strategy is to minimize the overlap between the activities of individual nodes. We propose a topic-oriented approach, in which the Web is partitioned into general subject areas with a crawler assigned to each. We examine design alternatives for a topic-oriented distributed crawler, including the creation of a Web page classifier for use in this context. The approach is compared experimentally with a hash-based partitioning, in which crawler assignments are determined by hash functions computed over URLs and page contents. The experimental evaluation demonstrates the feasibility of the approach, addressing issues of communication overhead, duplicate content detection, and page quality assessment.