Crawling the infinite web

Authors:
Ricardo Baeza-Yates;Carlos Castillo
Affiliations:
Yahoo! Research and Center for Web Research, Dept. of Computer Science, University of Chile, Santiago, Chile;Yahoo! Research and Universitat Pompeu Fabra, Catalunya, Spain
Venue:
Journal of Web Engineering
Year:
2007

Citing 19
Cited 5

Characterizing browsing strategies in the World-Wide Web

Proceedings of the Third International World-Wide Web conference on Technology, tools and applications
Revisitation patterns in World Wide Web navigation

Proceedings of the ACM SIGCHI Conference on Human factors in computing systems
Surfing as a real option

Proceedings of the first international conference on Information and computation economies
Synchronizing a database to improve freshness

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
On near-uniform URL sampling

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Breadth-first crawling yields high-quality pages

Proceedings of the 10th international conference on World Wide Web
Parallel crawlers

Proceedings of the 11th international conference on World Wide Web
Zipf's law for Web surfers

Knowledge and Information Systems
Discovery of Web Robot Sessions Based on their Navigational Patterns

Data Mining and Knowledge Discovery
Mercator: A scalable, extensible Web crawler

World Wide Web
Hyperlink Analysis for the Web

IEEE Internet Computing
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
A Unified Probabilistic Framework for Web Page Scoring Systems

IEEE Transactions on Knowledge and Data Engineering
Databases Deepen the Web

Computer
Advanced Data Preprocessing for Intersites Web Usage Mining

IEEE Intelligent Systems
Characterizing Web Usage Regularities with Information Foraging Agents

IEEE Transactions on Knowledge and Data Engineering
Ranking the web frontier

Proceedings of the 13th international conference on World Wide Web
High performance crawling system

Proceedings of the 6th ACM SIGMM international workshop on Multimedia information retrieval
Characterization of national Web domains

ACM Transactions on Internet Technology (TOIT)

Measuring the Search Effectiveness of a Breadth-First Crawl

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
The impact of crawl policy on web search effectiveness

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Web Crawling

Foundations and Trends in Information Retrieval
The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists

Proceedings of the 3rd Annual ACM Web Science Conference
Slash-based relevance propagation model for topic distillation

Journal of Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many publicly available Web pages are generated dynamically upon request, and contain links to other dynamically generated pages. Web sites that are built with dynamic pages can create, in principle, a very large amount of Web pages. This poses a problem for the crawlers of Web search engines, as the network and storage resources required for indexing Web pages are neither infinite nor free. In this article, several probabilistic models for user browsing in "infinite" Web sites are proposed and studied. These models aim at predicting how deep users go while exploring Web sites. We use these models to estimate how deep a crawler must go to download a significant portion of the Web site content that is actually visited. The proposed models are validated against real data on page views in several Web sites, showing that, in both theory and practice, a crawler needs to download just a few levels, no more than 3 to 5 "clicks" away from the start page, to reach 90% of the pages that users actually visit.