Adaptive on-line page importance computation
WWW '03 Proceedings of the 12th international conference on World Wide Web
Effective page refresh policies for Web crawlers
ACM Transactions on Database Systems (TODS)
WWW '05 Proceedings of the 14th international conference on World Wide Web
Effective change detection using sampling
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
RankMass crawler: a crawler with high personalized pagerank coverage guarantee
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Crawl ordering by search impact
WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Efficient Discovery of Authoritative Resources
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
The impact of crawl policy on web search effectiveness
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Towards recency ranking in web search
Proceedings of the third ACM international conference on Web search and data mining
Learning influence probabilities in social networks
Proceedings of the third ACM international conference on Web search and data mining
Foundations and Trends in Information Retrieval
Hi-index | 0.00 |
In this paper, we study the problem of timely finding and crawling of \textit{ephemeral} new pages, i.e., for which user traffic grows really quickly right after they appear, but lasts only for several days (e.g., news, blog and forum posts). Traditional crawling policies do not give any particular priority to such pages and may thus crawl them not quickly enough, and even crawl already obsolete content. We thus propose a new metric, well thought out for this task, which takes into account the decrease of user interest for ephemeral pages over time. We show that most ephemeral new pages can be found at a relatively small set of content sources and suggest a method for finding such a set. Our idea is to periodically recrawl content sources and crawl newly created pages linked from them, focusing on high-quality (in terms of user interest) content. One of the main difficulties here is to divide resources between these two activities in an efficient way. We find the adaptive balance between crawls and recrawls by maximizing the proposed metric. Further, we incorporate search engine click logs to give our crawler an insight about the current user demands. The effectiveness of our approach is finally demonstrated experimentally on real-world data.