Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Foundations and Trends in Information Retrieval
SPRINT: ranking search results by paths
Proceedings of the 14th International Conference on Extending Database Technology
The SHARC framework for data quality in Web archiving
The VLDB Journal — The International Journal on Very Large Data Bases
Proceedings of the 3rd Annual ACM Web Science Conference
Hi-index | 0.00 |
One of the most important steps in web crawling is determining the starting points, or seed selection. This paper identifies and explores the problem of seed selection in web-scale incremental crawlers. We argue that seed selection is not a trivial but very important problem. Selecting proper seeds can increase the number of pages a crawler will discover, and can result in a repository with more "good" and less "bad" pages. We propose a graph-based framework for crawler seed selection, and present several algorithms within this framework. Evaluation on real web data showed significant improvements over heuristic seed selection approaches.