Web-crawling reliability

Authors:
Viv Cothey
Affiliations:
School of Computing and Information Technology, University of Wolverhampton, Lichfield Street, Wolverhampton, United Kingdom, WV1 1SB
Venue:
Journal of the American Society for Information Science and Technology - Special issue: Webometrics
Year:
2004

Citing 18
Cited 7

Programming Perl (2nd ed.)

Programming Perl (2nd ed.)
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
A technique for measuring the relative size and overlap of public Web search engines

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Finding information on the World Wide Web: the retrieval effectiveness of search engines

Information Processing and Management: an International Journal
Results and challenges in Web search evaluation

WWW '99 Proceedings of the eighth international conference on World Wide Web
Accessibility of information on the Web

intelligence
Graph structure in the Web

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
A comparison of techniques to find mirrored hosts on the WWW

Journal of the American Society for Information Science
An adaptive model for optimizing performance of an incremental web crawler

Proceedings of the 10th international conference on World Wide Web
Breadth-first crawling yields high-quality pages

Proceedings of the 10th international conference on World Wide Web
Google's Web Page Ranking applied to different topological Web Graph structures

Journal of the American Society for Information Science
Parallel crawlers

Proceedings of the 11th international conference on World Wide Web
The structure of broad topics on the web

Proceedings of the 11th international conference on World Wide Web
Hyperlink Analysis for the Web

IEEE Internet Computing
Focused Crawls, Tunneling, and Digital Libraries

ECDL '02 Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries
The connectivity sonar: detecting site functionality by structural patterns

Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
Engineering a multi-purpose test collection for web retrieval experiments

Information Processing and Management: an International Journal
The web as a graph: measurements, models, and methods

COCOON'99 Proceedings of the 5th annual international conference on Computing and combinatorics

Text characteristics of English language university Web sites: Research Articles

Journal of the American Society for Information Science and Technology
The freshness of web search engine databases

Journal of Information Science
Evaluation of crawling policies for a web-repository crawler

Proceedings of the seventeenth conference on Hypertext and hypermedia
Decoding the structure of the WWW: A comparative analysis of Web crawls

ACM Transactions on the Web (TWEB)
The Viúva Negra crawler: an experience report

Software—Practice & Experience
A three-year study on the freshness of web search engine databases

Journal of Information Science
Gain based evaluation measure for ranked web results

Proceedings of the International Conference and Workshop on Emerging Trends in Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this article, I investigate the reliability, in the social science sense, of collecting informetric data about the World Wide Web by Web crawling. The investigation includes a critical examination of the practice of Web crawling and contrasts the results of content crawling with the results of link crawling. It is shown that Web crawling by search engines is intentionally biased and selective. I also report the results of a large-scale experimental simulation of Web crawling that illustrates the effects of different crawling policies on data collection. It is concluded that the reliability of Web crawling as a data collection technique is improved by fuller reporting of relevant crawling policies.