RankMass crawler: a crawler with high personalized pagerank coverage guarantee

Authors:
Junghoo Cho;Uri Schonfeld
Affiliations:
University of California Los Angeles, Los Angeles, CA;University of California Los Angeles, Los Angeles, CA
Venue:
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Year:
2007

Citing 26
Cited 17

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Measuring index quality using random walks on the Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Breadth-first crawling yields high-quality pages

Proceedings of the 10th international conference on World Wide Web
Parallel crawlers

Proceedings of the 11th international conference on World Wide Web
Optimal crawling strategies for web search engines

Proceedings of the 11th international conference on World Wide Web
Accelerated focused crawling through online relevance feedback

Proceedings of the 11th international conference on World Wide Web
Topic-sensitive PageRank

Proceedings of the 11th international conference on World Wide Web
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
Extrapolation methods for accelerating PageRank computations

WWW '03 Proceedings of the 12th international conference on World Wide Web
Scaling personalized web search

WWW '03 Proceedings of the 12th international conference on World Wide Web
Adaptive on-line page importance computation

WWW '03 Proceedings of the 12th international conference on World Wide Web
Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search

IEEE Transactions on Knowledge and Data Engineering
Inside PageRank

ACM Transactions on Internet Technology (TOIT)
User-centric Web crawling

WWW '05 Proceedings of the 14th international conference on World Wide Web
Crawling a country: better strategies than breadth-first for web page ordering

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Topical TrustRank: using topicality to combat web spam

Proceedings of the 15th international conference on World Wide Web
Random sampling from a search engine's index

Proceedings of the 15th international conference on World Wide Web
The distribution of pageRank follows a power-law only for particular values of the damping factor

Proceedings of the 15th international conference on World Wide Web
PageRank revisited

ACM Transactions on Internet Technology (TOIT)
Efficient and decentralized PageRank approximation in a peer-to-peer web search network

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Accurate and efficient crawling for relevant websites

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Computing pagerank in a distributed internet search system

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Combating web spam with trustrank

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30

Search effectiveness with a breadth-first crawl

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Sitemaps: above and beyond the crawl of duty

Proceedings of the 18th international conference on World wide web
Fractional PageRank Crawler: Prioritizing URLs Efficiently for Crawling Important Pages Early

DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
Measuring the Search Effectiveness of a Breadth-First Crawl

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
The impact of crawl policy on web search effectiveness

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
FICA: A novel intelligent crawling algorithm based on reinforcement learning

Web Intelligence and Agent Systems
SHARC: framework for quality-conscious web archiving

Proceedings of the VLDB Endowment
Web Crawling

Foundations and Trends in Information Retrieval
Where to crawl next for focused crawlers

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part IV
The SHARC framework for data quality in Web archiving

The VLDB Journal — The International Journal on Very Large Data Bases
User browsing behavior-driven web crawling

Proceedings of the 20th ACM international conference on Information and knowledge management
A novel crawling algorithm for web pages

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists

Proceedings of the 3rd Annual ACM Web Science Conference
Tweet recommendation with graph co-ranking

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Efficient ad-hoc search for personalized PageRank

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Timely crawling of high-quality ephemeral new content

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
CUVIM: extracting fresh information from social network

WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Crawling algorithms have been the subject of extensive research and optimizations, but some important questions remain open. In particular, given the unbounded number of pages available on the Web, search-engine operators constantly struggle with the following vexing questions: When can I stop downloading the Web? How many pages should I download to cover "most" of the Web? How can I know I am not missing an important part when I stop? In this paper we provide an answer to these questions by developing, in the context of a system that is given a set of trusted pages, a family of crawling algorithms that (1) provide a theoretical guarantee on how much of the "important" part of the Web it will download after crawling a certain number of pages and (2) give a high priority to important pages during a crawl, so that the search engine can index the most important part of the Web first. We prove the correctness of our algorithms by theoretical analysis and evaluate their performance experimentally based on 141 million URLs obtained from the Web. Our experiments demonstrate that even our simple algorithm is effective in downloading important pages early on and provides high "coverage" of the Web with a relatively small number of pages.