Efficient crawling through URL ordering
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Measuring index quality using random walks on the Web
WWW '99 Proceedings of the eighth international conference on World Wide Web
Cumulated gain-based evaluation of IR techniques
ACM Transactions on Information Systems (TOIS)
The Evolution of the Web and Implications for an Incremental Crawler
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Adaptive on-line page importance computation
WWW '03 Proceedings of the 12th international conference on World Wide Web
A large-scale study of the evolution of web pages
WWW '03 Proceedings of the 12th international conference on World Wide Web
Multi-Tier Architecture for Web Search Engines
LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
What's new on the web?: the evolution of the web from a search engine perspective
Proceedings of the 13th international conference on World Wide Web
Sic transit gloria telae: towards an understanding of the web's decay
Proceedings of the 13th international conference on World Wide Web
Crawling a country: better strategies than breadth-first for web page ordering
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
The discoverability of the web
Proceedings of the 16th international conference on World Wide Web
Information re-retrieval: repeat queries in Yahoo's logs
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Hits on the web: how does it compare?
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
RankMass crawler: a crawler with high personalized pagerank coverage guarantee
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Crawl ordering by search impact
WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
IRLbot: scaling to 6 billion pages and beyond
Proceedings of the 17th international conference on World Wide Web
Journal of Web Engineering
Development of a large-scale web crawler and search engine infrastructure
Proceedings of the 3rd International Universal Communication Symposium
Foundations and Trends in Information Retrieval
The importance of anchor text for ad hoc search revisited
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 20th international conference on World wide web
Discovering URLs through user feedback
Proceedings of the 20th ACM international conference on Information and knowledge management
User browsing behavior-driven web crawling
Proceedings of the 20th ACM international conference on Information and knowledge management
Incorporating social anchors for ad hoc retrieval
Proceedings of the 10th Conference on Open Research Areas in Information Retrieval
Timely crawling of high-quality ephemeral new content
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Hi-index | 0.00 |
Crawl selection policy has a direct influence on Web search effectiveness, because a useful page that is not selected for crawling will also be absent from search results. Yet there has been little or no work on measuring this effect. We introduce an evaluation framework, based on relevance judgments pooled from multiple search engines, measuring the maximum potential NDCG that is achievable using a particular crawl. This allows us to evaluate different crawl policies and investigate important scenarios like selection stability over multiple iterations. We conduct two sets of crawling experiments at the scale of 1~billion and 100~million pages respectively. These show that crawl selection based on PageRank, indegree and trans-domain indegree all allow better retrieval effectiveness than a simple breadth-first crawl of the same size. PageRank is the most reliable and effective method. Trans-domain indegree can outperform PageRank, but over multiple crawl iterations it is less effective and more unstable. Finally we experiment with combinations of crawl selection methods and per-domain page limits, which yield crawls with greater potential NDCG than PageRank.