The impact of crawl policy on web search effectiveness

Authors:
Dennis Fetterly;Nick Craswell;Vishwa Vinay
Affiliations:
Microsoft Research, Mountain View, CA, USA;Microsoft Research, Cambridge, United Kingdom;Microsoft Research, Cambridge, United Kingdom
Venue:
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Year:
2009

Citing 17
Cited 8

Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Measuring index quality using random walks on the Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Cumulated gain-based evaluation of IR techniques

ACM Transactions on Information Systems (TOIS)
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Adaptive on-line page importance computation

WWW '03 Proceedings of the 12th international conference on World Wide Web
A large-scale study of the evolution of web pages

WWW '03 Proceedings of the 12th international conference on World Wide Web
Multi-Tier Architecture for Web Search Engines

LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
What's new on the web?: the evolution of the web from a search engine perspective

Proceedings of the 13th international conference on World Wide Web
Sic transit gloria telae: towards an understanding of the web's decay

Proceedings of the 13th international conference on World Wide Web
Crawling a country: better strategies than breadth-first for web page ordering

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
The discoverability of the web

Proceedings of the 16th international conference on World Wide Web
Information re-retrieval: repeat queries in Yahoo's logs

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Hits on the web: how does it compare?

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
RankMass crawler: a crawler with high personalized pagerank coverage guarantee

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Crawl ordering by search impact

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
IRLbot: scaling to 6 billion pages and beyond

Proceedings of the 17th international conference on World Wide Web
Crawling the infinite web

Journal of Web Engineering

Development of a large-scale web crawler and search engine infrastructure

Proceedings of the 3rd International Universal Communication Symposium
Web Crawling

Foundations and Trends in Information Retrieval
The importance of anchor text for ad hoc search revisited

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Model characterization curves for federated search using click-logs: predicting user engagement metrics for the span of feasible operating points

Proceedings of the 20th international conference on World wide web
Discovering URLs through user feedback

Proceedings of the 20th ACM international conference on Information and knowledge management
User browsing behavior-driven web crawling

Proceedings of the 20th ACM international conference on Information and knowledge management
Incorporating social anchors for ad hoc retrieval

Proceedings of the 10th Conference on Open Research Areas in Information Retrieval
Timely crawling of high-quality ephemeral new content

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Crawl selection policy has a direct influence on Web search effectiveness, because a useful page that is not selected for crawling will also be absent from search results. Yet there has been little or no work on measuring this effect. We introduce an evaluation framework, based on relevance judgments pooled from multiple search engines, measuring the maximum potential NDCG that is achievable using a particular crawl. This allows us to evaluate different crawl policies and investigate important scenarios like selection stability over multiple iterations. We conduct two sets of crawling experiments at the scale of 1~billion and 100~million pages respectively. These show that crawl selection based on PageRank, indegree and trans-domain indegree all allow better retrieval effectiveness than a simple breadth-first crawl of the same size. PageRank is the most reliable and effective method. Trans-domain indegree can outperform PageRank, but over multiple crawl iterations it is less effective and more unstable. Finally we experiment with combinations of crawl selection methods and per-domain page limits, which yield crawls with greater potential NDCG than PageRank.