The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Measuring index quality using random walks on the Web
WWW '99 Proceedings of the eighth international conference on World Wide Web
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
Breadth-first crawling yields high-quality pages
Proceedings of the 10th international conference on World Wide Web
Proceedings of the 11th international conference on World Wide Web
Optimal crawling strategies for web search engines
Proceedings of the 11th international conference on World Wide Web
Accelerated focused crawling through online relevance feedback
Proceedings of the 11th international conference on World Wide Web
Proceedings of the 11th international conference on World Wide Web
Focused Crawling Using Context Graphs
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Proceedings of the 27th International Conference on Very Large Data Bases
Extrapolation methods for accelerating PageRank computations
WWW '03 Proceedings of the 12th international conference on World Wide Web
Scaling personalized web search
WWW '03 Proceedings of the 12th international conference on World Wide Web
Adaptive on-line page importance computation
WWW '03 Proceedings of the 12th international conference on World Wide Web
Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search
IEEE Transactions on Knowledge and Data Engineering
ACM Transactions on Internet Technology (TOIT)
WWW '05 Proceedings of the 14th international conference on World Wide Web
Crawling a country: better strategies than breadth-first for web page ordering
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Topical TrustRank: using topicality to combat web spam
Proceedings of the 15th international conference on World Wide Web
Random sampling from a search engine's index
Proceedings of the 15th international conference on World Wide Web
The distribution of pageRank follows a power-law only for particular values of the damping factor
Proceedings of the 15th international conference on World Wide Web
ACM Transactions on Internet Technology (TOIT)
Efficient and decentralized PageRank approximation in a peer-to-peer web search network
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Accurate and efficient crawling for relevant websites
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Computing pagerank in a distributed internet search system
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Combating web spam with trustrank
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Search effectiveness with a breadth-first crawl
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Sitemaps: above and beyond the crawl of duty
Proceedings of the 18th international conference on World wide web
Fractional PageRank Crawler: Prioritizing URLs Efficiently for Crawling Important Pages Early
DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
Measuring the Search Effectiveness of a Breadth-First Crawl
ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
The impact of crawl policy on web search effectiveness
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
FICA: A novel intelligent crawling algorithm based on reinforcement learning
Web Intelligence and Agent Systems
SHARC: framework for quality-conscious web archiving
Proceedings of the VLDB Endowment
Foundations and Trends in Information Retrieval
Where to crawl next for focused crawlers
KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part IV
The SHARC framework for data quality in Web archiving
The VLDB Journal — The International Journal on Very Large Data Bases
User browsing behavior-driven web crawling
Proceedings of the 20th ACM international conference on Information and knowledge management
A novel crawling algorithm for web pages
AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
Proceedings of the 3rd Annual ACM Web Science Conference
Tweet recommendation with graph co-ranking
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Efficient ad-hoc search for personalized PageRank
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Timely crawling of high-quality ephemeral new content
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
CUVIM: extracting fresh information from social network
WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
Hi-index | 0.00 |
Crawling algorithms have been the subject of extensive research and optimizations, but some important questions remain open. In particular, given the unbounded number of pages available on the Web, search-engine operators constantly struggle with the following vexing questions: When can I stop downloading the Web? How many pages should I download to cover "most" of the Web? How can I know I am not missing an important part when I stop? In this paper we provide an answer to these questions by developing, in the context of a system that is given a set of trusted pages, a family of crawling algorithms that (1) provide a theoretical guarantee on how much of the "important" part of the Web it will download after crawling a certain number of pages and (2) give a high priority to important pages during a crawl, so that the search engine can index the most important part of the Web first. We prove the correctness of our algorithms by theoretical analysis and evaluate their performance experimentally based on 141 million URLs obtained from the Web. Our experiments demonstrate that even our simple algorithm is effective in downloading important pages early on and provides high "coverage" of the Web with a relatively small number of pages.