The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
Synchronizing a database to improve freshness
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
An adaptive model for optimizing performance of an incremental web crawler
Proceedings of the 10th international conference on World Wide Web
Breadth-first crawling yields high-quality pages
Proceedings of the 10th international conference on World Wide Web
Approximation algorithms
Optimal crawling strategies for web search engines
Proceedings of the 11th international conference on World Wide Web
Predictive caching and prefetching of query results in search engines
WWW '03 Proceedings of the 12th international conference on World Wide Web
Adaptive on-line page importance computation
WWW '03 Proceedings of the 12th international conference on World Wide Web
Impact of search engines on page popularity
Proceedings of the 13th international conference on World Wide Web
Proceedings of the 13th international conference on World Wide Web
Ruling Out PTAS for Graph Min-Bisection, Densest Subgraph and Bipartite Clique
FOCS '04 Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science
WWW '05 Proceedings of the 14th international conference on World Wide Web
The influence of search engines on preferential attachment
SODA '05 Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms
Beyond PageRank: machine learning for static ranking
Proceedings of the 15th international conference on World Wide Web
Data Mining
The discoverability of the web
Proceedings of the 16th international conference on World Wide Web
The impact of crawl policy on web search effectiveness
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Foundations and Trends in Information Retrieval
Mining Query Logs: Turning Search Usage Data into Knowledge
Foundations and Trends in Information Retrieval
Popularity-guided top-k extraction of entity attributes
Procceedings of the 13th International Workshop on the Web and Databases
Design and implementation of contextual information portals
Proceedings of the 20th international conference companion on World wide web
Caché: caching location-enhanced content to improve user privacy
MobiSys '11 Proceedings of the 9th international conference on Mobile systems, applications, and services
User browsing behavior-driven web crawling
Proceedings of the 20th ACM international conference on Information and knowledge management
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Timely crawling of high-quality ephemeral new content
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Hi-index | 0.00 |
We study how to prioritize the fetching of new pages under the objective of maximizing the quality of search results. In particular, our objective is to fetch new pages that have the most impact, where the impact of a page is equal to the number of times the page appears in the top K search results for queries, for some constant K, e.g., K = 10. Since the impact of a page depends on its relevance score for queries, which in turn depends on the page content, the main difficulty lies in estimating the impact of the page before actually fetching it. Hence, impact must be estimated based on the limited information that is available prior to fetching page content, e.g., the URL string, number of in-links, referring anchortext We formally characterize this problem and study its hardness. We leverage our formalism to design a new impact-driven crawling policy, and demonstrate its effectiveness using real world data. Our technique ensures that the crawler acquires content relevant to "tail topics" that are obscure but of interest to some users, rather than just redundantly accumulating content on popular topics.