Characterizing browsing strategies in the World-Wide Web
Proceedings of the Third International World-Wide Web conference on Technology, tools and applications
Revisitation patterns in World Wide Web navigation
Proceedings of the ACM SIGCHI Conference on Human factors in computing systems
Proceedings of the first international conference on Information and computation economies
Synchronizing a database to improve freshness
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Breadth-first crawling yields high-quality pages
Proceedings of the 10th international conference on World Wide Web
Proceedings of the 11th international conference on World Wide Web
Knowledge and Information Systems
Discovery of Web Robot Sessions Based on their Navigational Patterns
Data Mining and Knowledge Discovery
Mercator: A scalable, extensible Web crawler
World Wide Web
Hyperlink Analysis for the Web
IEEE Internet Computing
Proceedings of the 27th International Conference on Very Large Data Bases
A Unified Probabilistic Framework for Web Page Scoring Systems
IEEE Transactions on Knowledge and Data Engineering
Computer
Advanced Data Preprocessing for Intersites Web Usage Mining
IEEE Intelligent Systems
Characterizing Web Usage Regularities with Information Foraging Agents
IEEE Transactions on Knowledge and Data Engineering
Proceedings of the 13th international conference on World Wide Web
High performance crawling system
Proceedings of the 6th ACM SIGMM international workshop on Multimedia information retrieval
Characterization of national Web domains
ACM Transactions on Internet Technology (TOIT)
Measuring the Search Effectiveness of a Breadth-First Crawl
ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
The impact of crawl policy on web search effectiveness
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Foundations and Trends in Information Retrieval
Proceedings of the 3rd Annual ACM Web Science Conference
Slash-based relevance propagation model for topic distillation
Journal of Web Engineering
Hi-index | 0.00 |
Many publicly available Web pages are generated dynamically upon request, and contain links to other dynamically generated pages. Web sites that are built with dynamic pages can create, in principle, a very large amount of Web pages. This poses a problem for the crawlers of Web search engines, as the network and storage resources required for indexing Web pages are neither infinite nor free. In this article, several probabilistic models for user browsing in "infinite" Web sites are proposed and studied. These models aim at predicting how deep users go while exploring Web sites. We use these models to estimate how deep a crawler must go to download a significant portion of the Web site content that is actually visited. The proposed models are validated against real data on page views in several Web sites, showing that, in both theory and practice, a crawler needs to download just a few levels, no more than 3 to 5 "clicks" away from the start page, to reach 90% of the pages that users actually visit.