Timely crawling of high-quality ephemeral new content

Authors:
Damien Lefortier;Liudmila Ostroumova;Egor Samosvat;Pavel Serdyukov
Affiliations:
Yandex, Moscow, Russian Fed.;Yandex, Moscow, Russian Fed.;Yandex, Moscow, Russian Fed.;Yandex, Moscow, Russian Fed.
Venue:
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Year:
2013

Citing 11
Cited 0

Adaptive on-line page importance computation

WWW '03 Proceedings of the 12th international conference on World Wide Web
Effective page refresh policies for Web crawlers

ACM Transactions on Database Systems (TODS)
User-centric Web crawling

WWW '05 Proceedings of the 14th international conference on World Wide Web
Effective change detection using sampling

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
RankMass crawler: a crawler with high personalized pagerank coverage guarantee

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Crawl ordering by search impact

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Efficient Discovery of Authoritative Resources

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
The impact of crawl policy on web search effectiveness

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Towards recency ranking in web search

Proceedings of the third ACM international conference on Web search and data mining
Learning influence probabilities in social networks

Proceedings of the third ACM international conference on Web search and data mining
Web Crawling

Foundations and Trends in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we study the problem of timely finding and crawling of \textit{ephemeral} new pages, i.e., for which user traffic grows really quickly right after they appear, but lasts only for several days (e.g., news, blog and forum posts). Traditional crawling policies do not give any particular priority to such pages and may thus crawl them not quickly enough, and even crawl already obsolete content. We thus propose a new metric, well thought out for this task, which takes into account the decrease of user interest for ephemeral pages over time. We show that most ephemeral new pages can be found at a relatively small set of content sources and suggest a method for finding such a set. Our idea is to periodically recrawl content sources and crawl newly created pages linked from them, focusing on high-quality (in terms of user interest) content. One of the main difficulties here is to divide resources between these two activities in an efficient way. We find the adaptive balance between crawls and recrawls by maximizing the proposed metric. Further, we incorporate search engine click logs to give our crawler an insight about the current user demands. The effectiveness of our approach is finally demonstrated experimentally on real-world data.