The discoverability of the web

Authors:
Anirban Dasgupta;Arpita Ghosh;Ravi Kumar;Christopher Olston;Sandeep Pandey;Andrew Tomkins
Affiliations:
Yahoo! Research, Sunnyvale, CA;Yahoo! Research, Sunnyvale, CA;Yahoo! Research, Sunnyvale, CA;Yahoo! Research, Sunnyvale, CA;Yahoo! Research, Sunnyvale, CA;Yahoo! Research, Sunnyvale, CA
Venue:
Proceedings of the 16th international conference on World Wide Web
Year:
2007

Citing 20
Cited 17

Life, death, and lawfulness on the electronic frontier

Proceedings of the ACM SIGCHI Conference on Human factors in computing systems
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Trawling the Web for emerging cyber-communities

WWW '99 Proceedings of the eighth international conference on World Wide Web
On power-law relationships of the Internet topology

Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication
Synchronizing a database to improve freshness

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
How dynamic is the Web?

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Graph structure in the Web

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
An adaptive model for optimizing performance of an incremental web crawler

Proceedings of the 10th international conference on World Wide Web
Approximation algorithms

Approximation algorithms
Optimal crawling strategies for web search engines

Proceedings of the 11th international conference on World Wide Web
Computational Complexity of Machine Learning

Computational Complexity of Machine Learning
Finite-time Analysis of the Multiarmed Bandit Problem

Machine Learning
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Approximation algorithms for set cover and related problems

Approximation algorithms for set cover and related problems
On the Evolution of Clusters of Near-Duplicate Web Pages

LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
What's new on the web?: the evolution of the web from a search engine perspective

Proceedings of the 13th international conference on World Wide Web
Ranking the web frontier

Proceedings of the 13th international conference on World Wide Web
A large-scale study of the evolution of web pages

Software—Practice & Experience - Special issue: Web technologies
User-centric Web crawling

WWW '05 Proceedings of the 14th international conference on World Wide Web
Rate of change and other metrics: a live study of the world wide web

USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems

Crawl ordering by search impact

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Can social bookmarking improve web search?

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Local approximation of pagerank and reverse pagerank

Proceedings of the 17th ACM conference on Information and knowledge management
Retrievability: an evaluation measure for higher order information access tasks

Proceedings of the 17th ACM conference on Information and knowledge management
Measuring the Search Effectiveness of a Breadth-First Crawl

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
The impact of crawl policy on web search effectiveness

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Automated ontology instantiation from tabular web sources-The AllRight system

Web Semantics: Science, Services and Agents on the World Wide Web
FICA: A novel intelligent crawling algorithm based on reinforcement learning

Web Intelligence and Agent Systems
Web Crawling

Foundations and Trends in Information Retrieval
Max-cover in map-reduce

Proceedings of the 19th international conference on World wide web
Caching search engine results over incremental indices

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Discovering URLs through user feedback

Proceedings of the 20th ACM international conference on Information and knowledge management
Scalable manipulation of archival web graphs

Proceedings of the 9th workshop on Large-scale and distributed informational retrieval
Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes

ACM Transactions on the Web (TWEB)
A novel crawling algorithm for web pages

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
PageRank on an evolving graph

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Hidden-Web induced by client-side scripting: an empirical study

ICWE'13 Proceedings of the 13th international conference on Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Previous studies have highlighted the high arrival rate of new contenton the web. We study the extent to which this new content can beefficiently discovered by a crawler. Our study has two parts. First,we study the inherent difficulty of the discovery problem using amaximum cover formulation, under an assumption of perfect estimates oflikely sources of links to new content. Second, we relax thisassumption and study a more realistic setting in which algorithms mustuse historical statistics to estimate which pages are most likely toyield links to new content. We recommend a simple algorithm thatperforms comparably to all approaches we consider.We measure the emphoverhead of discovering new content, defined asthe average number of fetches required to discover one new page. Weshow first that with perfect foreknowledge of where to explore forlinks to new content, it is possible to discover 90% of all newcontent with under 3% overhead, and 100% of new content with 9%overhead. But actual algorithms, which do not have access to perfectforeknowledge, face a more difficult task: one quarter of new contentis simply not amenable to efficient discovery. Of the remaining threequarters, 80% of new content during a given week may be discoveredwith 160% overhead if content is recrawled fully on a monthly basis.