Measuring the Search Effectiveness of a Breadth-First Crawl

Authors:
Dennis Fetterly;Nick Craswell;Vishwa Vinay
Affiliations:
Microsoft Research Silicon Valley, Mountain View, USA;Microsoft Research Cambridge, Cambridge, UK;Microsoft Research Cambridge, Cambridge, UK
Venue:
ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Year:
2009

Citing 18
Cited 0

Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Measuring index quality using random walks on the Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Breadth-first crawling yields high-quality pages

Proceedings of the 10th international conference on World Wide Web
Cumulated gain-based evaluation of IR techniques

ACM Transactions on Information Systems (TOIS)
On Collection Size and Retrieval Effectiveness

Information Retrieval
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Retrieval evaluation with incomplete information

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
User-centric Web crawling

WWW '05 Proceedings of the 14th international conference on World Wide Web
Relevance weighting for query independent evidence

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating average precision with incomplete and imperfect judgments

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
The discoverability of the web

Proceedings of the 16th international conference on World Wide Web
On the robustness of relevance measures with incomplete judgments

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Combating web spam with trustrank

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
RankMass crawler: a crawler with high personalized pagerank coverage guarantee

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
IRLbot: scaling to 6 billion pages and beyond

Proceedings of the 17th international conference on World Wide Web
Search effectiveness with a breadth-first crawl

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Crawling the infinite web

Journal of Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Previous scalability experiments found that early precision improves as collection size increases. However, that was under the assumption that a collection's documents are all sampled with uniform probability from the same population. We contrast this to a large breadth-first web crawl, an important scenario in real-world Web search, where the early documents have quite different characteristics from the later documents. Having observed that NDCG@100 (measured over a set of reference queries) begins to plateau in the initial stages of the crawl, we investigate a number of possible reasons for this behaviour. These include the web-pages themselves, the metric used to measure retrieval effectiveness as well as the set of relevance judgements used.