Graph-based seed selection for web-scale crawlers

Authors:
Shuyi Zheng;Pavel Dmitriev;C. Lee Giles
Affiliations:
Pennsylvania State University, University Park, PA, USA;Yahoo! Labs, Santa Clara, PA, USA;Pennsylvania State University, University Park, PA, USA
Venue:
Proceedings of the 18th ACM conference on Information and knowledge management
Year:
2009

Citing 1
Cited 4

Graph structure in the Web

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking

Web Crawling

Foundations and Trends in Information Retrieval
SPRINT: ranking search results by paths

Proceedings of the 14th International Conference on Extending Database Technology
The SHARC framework for data quality in Web archiving

The VLDB Journal — The International Journal on Very Large Data Bases
The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists

Proceedings of the 3rd Annual ACM Web Science Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the most important steps in web crawling is determining the starting points, or seed selection. This paper identifies and explores the problem of seed selection in web-scale incremental crawlers. We argue that seed selection is not a trivial but very important problem. Selecting proper seeds can increase the number of pages a crawler will discover, and can result in a repository with more "good" and less "bad" pages. We propose a graph-based framework for crawler seed selection, and present several algorithms within this framework. Evaluation on real web data showed significant improvements over heuristic seed selection approaches.