Crawl ordering by search impact

Authors:
Sandeep Pandey;Christopher Olston
Affiliations:
Carnegie Mellon University;Yahoo! Research
Venue:
WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Year:
2008

Citing 18
Cited 9

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Synchronizing a database to improve freshness

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
An adaptive model for optimizing performance of an incremental web crawler

Proceedings of the 10th international conference on World Wide Web
Breadth-first crawling yields high-quality pages

Proceedings of the 10th international conference on World Wide Web
Approximation algorithms

Approximation algorithms
Optimal crawling strategies for web search engines

Proceedings of the 11th international conference on World Wide Web
Predictive caching and prefetching of query results in search engines

WWW '03 Proceedings of the 12th international conference on World Wide Web
Adaptive on-line page importance computation

WWW '03 Proceedings of the 12th international conference on World Wide Web
Impact of search engines on page popularity

Proceedings of the 13th international conference on World Wide Web
Ranking the web frontier

Proceedings of the 13th international conference on World Wide Web
Ruling Out PTAS for Graph Min-Bisection, Densest Subgraph and Bipartite Clique

FOCS '04 Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science
User-centric Web crawling

WWW '05 Proceedings of the 14th international conference on World Wide Web
The influence of search engines on preferential attachment

SODA '05 Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms
Beyond PageRank: machine learning for static ranking

Proceedings of the 15th international conference on World Wide Web
Data Mining

Data Mining
The discoverability of the web

Proceedings of the 16th international conference on World Wide Web

The impact of crawl policy on web search effectiveness

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Web Crawling

Foundations and Trends in Information Retrieval
Mining Query Logs: Turning Search Usage Data into Knowledge

Foundations and Trends in Information Retrieval
Popularity-guided top-k extraction of entity attributes

Procceedings of the 13th International Workshop on the Web and Databases
Design and implementation of contextual information portals

Proceedings of the 20th international conference companion on World wide web
Caché: caching location-enhanced content to improve user privacy

MobiSys '11 Proceedings of the 9th international conference on Mobile systems, applications, and services
User browsing behavior-driven web crawling

Proceedings of the 20th ACM international conference on Information and knowledge management
PageRank on an evolving graph

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Timely crawling of high-quality ephemeral new content

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study how to prioritize the fetching of new pages under the objective of maximizing the quality of search results. In particular, our objective is to fetch new pages that have the most impact, where the impact of a page is equal to the number of times the page appears in the top K search results for queries, for some constant K, e.g., K = 10. Since the impact of a page depends on its relevance score for queries, which in turn depends on the page content, the main difficulty lies in estimating the impact of the page before actually fetching it. Hence, impact must be estimated based on the limited information that is available prior to fetching page content, e.g., the URL string, number of in-links, referring anchortext We formally characterize this problem and study its hardness. We leverage our formalism to design a new impact-driven crawling policy, and demonstrate its effectiveness using real world data. Our technique ensures that the crawler acquires content relevant to "tail topics" that are obscure but of interest to some users, rather than just redundantly accumulating content on popular topics.