Crawling the infinite web

  • Authors:
  • Ricardo Baeza-Yates;Carlos Castillo

  • Affiliations:
  • Yahoo! Research and Center for Web Research, Dept. of Computer Science, University of Chile, Santiago, Chile;Yahoo! Research and Universitat Pompeu Fabra, Catalunya, Spain

  • Venue:
  • Journal of Web Engineering
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many publicly available Web pages are generated dynamically upon request, and contain links to other dynamically generated pages. Web sites that are built with dynamic pages can create, in principle, a very large amount of Web pages. This poses a problem for the crawlers of Web search engines, as the network and storage resources required for indexing Web pages are neither infinite nor free. In this article, several probabilistic models for user browsing in "infinite" Web sites are proposed and studied. These models aim at predicting how deep users go while exploring Web sites. We use these models to estimate how deep a crawler must go to download a significant portion of the Web site content that is actually visited. The proposed models are validated against real data on page views in several Web sites, showing that, in both theory and practice, a crawler needs to download just a few levels, no more than 3 to 5 "clicks" away from the start page, to reach 90% of the pages that users actually visit.