Graph based crawler seed selection

Authors:
Shuyi Zheng;Pavel Dmitriev;C. Lee Giles
Affiliations:
Pennsylvania State University, University Park, PA, USA;Yahoo! Labs, Santa Clara, CA, USA;Pennsylvania State University, University Park, PA, USA
Venue:
Proceedings of the 18th international conference on World wide web
Year:
2009

Citing 0
Cited 2

Clustering-based incremental web crawling

ACM Transactions on Information Systems (TOIS)
Finding potential seeds through rank aggregation of web searches

PReMI'11 Proceedings of the 4th international conference on Pattern recognition and machine intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper identifies and explores the problem of seed selection in a web-scale crawler. We argue that seed selection is not a trivial but very important problem. Selecting proper seeds can increase the number of pages a crawler will discover, and can result in a collection with more ``good" and less "bad" pages. Based on the analysis of the graph structure of the web, we propose several seed selection algorithms. Effectiveness of these algorithms is proved by our experimental results on real web data.