On the design of a learning crawler for topical resource discovery

  • Authors:
  • Charu C. Aggarwal;Fatima Al-Garawi;Philip S. Yu

  • Affiliations:
  • IBM T. J. Watson Research Center, Yorktown Heights, NY;Columbia University, New York, NY;IBM T. J. Watson Research Center, Yorktown Heights, NY

  • Venue:
  • ACM Transactions on Information Systems (TOIS)
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

In recent years, the World Wide Web has shown enormous growth in size. Vast repositories of information are available on practically every possible topic. In such cases, it is valuable to perform topical resource discovery effectively. Consequently, several new ideas have been proposed in recent years; among them a key technique is focused crawling which is able to crawl particular topical portions of the World Wide Web quickly, without having to explore all web pages. In this paper, we propose the novel concept of intelligent crawling which actually learns characteristics of the linkage structure of the World Wide Web while performing the crawling. Specifically, the intelligent crawler uses the inlinking web page content, candidate URL structure, or other behaviors of the inlinking web pages or siblings in order to estimate the probability that a candidate is useful for a given crawl. This is a much more general framework than the focused crawling technique which is based on a pre-defined understanding of the topical structure of the web. The techniques discussed in this paper are applicable for crawling web pages which satisfy arbitrary user-defined predicates such as topical queries, keyword queries, or any combinations of the above. Unlike focused crawling, it is not necessary to provide representative topical examples, since the crawler can learn its way into the appropriate topic. We refer to this technique as intelligent crawling because of its adaptive nature in adjusting to the web page linkage structure. We discuss how to intelligently select features which are most useful for a given crawl. The learning crawler is capable of reusing the knowledge gained in a given crawl in order to provide more efficient crawling for closely related predicates.