Effectiveness beyond the first crawl tier

Authors:
Rodrygo L.T. Santos;Craig Macdonald;Iadh Ounis
Affiliations:
University of Glasgow, Glasgow, United Kingdom;University of Glasgow, Glasgow, United Kingdom;University of Glasgow, Glasgow, United Kingdom
Venue:
Proceedings of the 20th ACM international conference on Information and knowledge management
Year:
2011

Citing 5
Cited 0

Real life information retrieval: a study of user queries on the Web

ACM SIGIR Forum
Breadth-first crawling yields high-quality pages

Proceedings of the 10th international conference on World Wide Web
Crawling a country: better strategies than breadth-first for web page ordering

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Minimal test collections for retrieval evaluation

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient and effective spam filtering and re-ranking for large web datasets

Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern Web crawlers seek to visit quality documents first, and re-visit them more frequently than other documents. As a result, the first-tier crawl of a Web corpus is typically of higher quality compared to subsequent crawls. In this paper, we investigate the impact of first-tier documents on adhoc retrieval performance. In particular, we analyse the retrieval performance of runs submitted to the adhoc task of the TREC 2009 Web track in terms of how they rank first-tier documents and how these documents contribute to the performance of each run. Our results show that the performance of these runs is heavily dependent on their ability to rank first-tier documents. Moreover, we show that, different from leading Web search engines, their attempt to go beyond the first tier almost always results in decreased performance. Finally, we show that selectively removing spam from different tiers can be a direction for fully exploiting documents beyond the first tier.