Life, death, and lawfulness on the electronic frontier
Proceedings of the ACM SIGCHI Conference on Human factors in computing systems
Efficient crawling through URL ordering
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Trawling the Web for emerging cyber-communities
WWW '99 Proceedings of the eighth international conference on World Wide Web
On power-law relationships of the Internet topology
Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication
Synchronizing a database to improve freshness
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
An adaptive model for optimizing performance of an incremental web crawler
Proceedings of the 10th international conference on World Wide Web
Approximation algorithms
Optimal crawling strategies for web search engines
Proceedings of the 11th international conference on World Wide Web
Computational Complexity of Machine Learning
Computational Complexity of Machine Learning
Finite-time Analysis of the Multiarmed Bandit Problem
Machine Learning
The Evolution of the Web and Implications for an Incremental Crawler
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Approximation algorithms for set cover and related problems
Approximation algorithms for set cover and related problems
On the Evolution of Clusters of Near-Duplicate Web Pages
LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
What's new on the web?: the evolution of the web from a search engine perspective
Proceedings of the 13th international conference on World Wide Web
Proceedings of the 13th international conference on World Wide Web
A large-scale study of the evolution of web pages
Software—Practice & Experience - Special issue: Web technologies
WWW '05 Proceedings of the 14th international conference on World Wide Web
Rate of change and other metrics: a live study of the world wide web
USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
Crawl ordering by search impact
WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Can social bookmarking improve web search?
WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Local approximation of pagerank and reverse pagerank
Proceedings of the 17th ACM conference on Information and knowledge management
Retrievability: an evaluation measure for higher order information access tasks
Proceedings of the 17th ACM conference on Information and knowledge management
Measuring the Search Effectiveness of a Breadth-First Crawl
ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
The impact of crawl policy on web search effectiveness
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Automated ontology instantiation from tabular web sources-The AllRight system
Web Semantics: Science, Services and Agents on the World Wide Web
FICA: A novel intelligent crawling algorithm based on reinforcement learning
Web Intelligence and Agent Systems
Foundations and Trends in Information Retrieval
Proceedings of the 19th international conference on World wide web
Caching search engine results over incremental indices
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Discovering URLs through user feedback
Proceedings of the 20th ACM international conference on Information and knowledge management
Scalable manipulation of archival web graphs
Proceedings of the 9th workshop on Large-scale and distributed informational retrieval
Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes
ACM Transactions on the Web (TWEB)
A novel crawling algorithm for web pages
AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Hidden-Web induced by client-side scripting: an empirical study
ICWE'13 Proceedings of the 13th international conference on Web Engineering
Hi-index | 0.00 |
Previous studies have highlighted the high arrival rate of new contenton the web. We study the extent to which this new content can beefficiently discovered by a crawler. Our study has two parts. First,we study the inherent difficulty of the discovery problem using amaximum cover formulation, under an assumption of perfect estimates oflikely sources of links to new content. Second, we relax thisassumption and study a more realistic setting in which algorithms mustuse historical statistics to estimate which pages are most likely toyield links to new content. We recommend a simple algorithm thatperforms comparably to all approaches we consider.We measure the emphoverhead of discovering new content, defined asthe average number of fetches required to discover one new page. Weshow first that with perfect foreknowledge of where to explore forlinks to new content, it is possible to discover 90% of all newcontent with under 3% overhead, and 100% of new content with 9%overhead. But actual algorithms, which do not have access to perfectforeknowledge, face a more difficult task: one quarter of new contentis simply not amenable to efficient discovery. Of the remaining threequarters, 80% of new content during a given week may be discoveredwith 160% overhead if content is recrawled fully on a monthly basis.