Exploiting the structure of the web for spidering

  • Authors:
  • Thomas Dean;Joel D. Young

  • Affiliations:
  • Brown University;Brown University

  • Venue:
  • Exploiting the structure of the web for spidering
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Published experiments on searching the Web suggest that, given training data in the form of a (relatively) small subgraph of the Web containing a subset of a selected class of target pages, it is possible to conduct a directed search and find additional target pages significantly faster (i.e., with fewer page retrievals) than by performing a blind or uninformed random or systematic search such as depth- or breadth-first. An agent performing such a task is termed a spider. These experiments were carried out in specialized domains or under conditions that are difficult to replicate. We present an experimental framework in which to reexamine and resolve the basic questions in such a way that results can be replicated and built upon. We provide high-performance tools for building experimental spiders and make use of the ground truth and static nature of the WT10g TREC Web Corpus. We leverage powerful machine learning techniques within this experimental framework to learn regressions on discounted reward and navigational distance and apply these regressions to the problem of spidering. Experimental results on TREC 2001 Web ad hoc tasks show significant performance gains over blind and systematic search techniques.