The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists

Authors:
Jian Wu;Pradeep Teregowda;Juan Pablo Fernández Ramírez;Prasenjit Mitra;Shuyi Zheng;C. Lee Giles
Affiliations:
Pennsylvania State University, PA;Pennsylvania State University, PA;Pennsylvania State University, PA;Pennsylvania State University, PA;Facebook Inc., Menlo Park, CA;Pennsylvania State University, PA
Venue:
Proceedings of the 3rd Annual ACM Web Science Conference
Year:
2012

Citing 5
Cited 2

Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
RankMass crawler: a crawler with high personalized pagerank coverage guarantee

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Graph-based seed selection for web-scale crawlers

Proceedings of the 18th ACM conference on Information and knowledge management
Web Crawling

Foundations and Trends in Information Retrieval
Crawling the infinite web

Journal of Web Engineering

Web crawler middleware for search engine digital libraries: a case study for citeseerX

Proceedings of the twelfth international workshop on Web information and data management
Searching online book documents and analyzing book citations

Proceedings of the 2013 ACM symposium on Document engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a preliminary study of the evolution of a crawling strategy for an academic document search engine, in particular CiteSeerX. CiteSeerX actively crawls the web for academic and research documents primarily in computer and information sciences, and then performs unique information extraction and indexing extracting information such as OAI metadata, citations, tables and others. As such CiteSeerX could be considered a specialty or vertical search engine. To improve precision in resources expended, we replace a blacklist with a whitelist and compare the crawling efficiencies before and after this change. A blacklist means the crawl is forbidden from a certain list of URLs such as publisher domains but is otherwise unlimited. A whitelist means only certain domains are considered and others are not crawled The whitelist is generated based on domain ranking scores of approximately five million parent URLs harvested by the CiteSeerX crawler in the past four years. We calculate the F1 scores for each domain by applying equal weights to document numbers and citation rates. The whitelist is then generated by re-ordering parent URLs based on their domain ranking scores. We found that crawling the whitelist significantly increases the crawl precision by reducing a large amount of irrelevant requests and downloads.