A large-scale study of robots.txt

Authors:
Yang Sun;Ziming Zhuang;C. Lee Giles
Affiliations:
Pennsylvania State University;Pennsylvania State University;Pennsylvania State University
Venue:
Proceedings of the 16th international conference on World Wide Web
Year:
2007

Citing 1
Cited 7

Indexing aids at corporate websites: the use of robots.txt and META Tags

Information Processing and Management: an International Journal

A larger scale study of robots.txt

Proceedings of the 17th international conference on World Wide Web
Web Site Metadata

ICWE '9 Proceedings of the 9th International Conference on Web Engineering
Measuring the web crawler ethics

Proceedings of the 19th international conference on World wide web
Developing artificial agents worthy of trust: "Would you buy a used car from this artificial agent?"

Ethics and Information Technology
Web site metadata

Journal of Web Engineering
Intelligent Social Media Indexing and Sharing Using an Adaptive Indexing Search Engine

ACM Transactions on Intelligent Systems and Technology (TIST)
Towards automatic assessment of government web sites

Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Search engines largely rely on Web robots to collect information from the Web. Due to the unregulated open-access nature of the Web, robot activities are extremely diverse. Such crawling activities can be regulated from the server side by deploying the Robots Exclusion Protocol in a file called robots.txt. Although it is not an enforcement standard, ethical robots (and many commercial) will follow the rules specified in robots.txt. With our focused crawler, we investigate 7,593 websites from education, government, news, and business domains. Five crawls have been conducted in succession to study the temporal changes. Through statistical analysis of the data, we present a survey of the usage of Web robots rules at the Web scale. The results also show that the usage of robots.txt has increased over time.