Indexing aids at corporate websites: the use of robots.txt and META Tags
Information Processing and Management: an International Journal
A larger scale study of robots.txt
Proceedings of the 17th international conference on World Wide Web
ICWE '9 Proceedings of the 9th International Conference on Web Engineering
Measuring the web crawler ethics
Proceedings of the 19th international conference on World wide web
Developing artificial agents worthy of trust: "Would you buy a used car from this artificial agent?"
Ethics and Information Technology
Journal of Web Engineering
Intelligent Social Media Indexing and Sharing Using an Adaptive Indexing Search Engine
ACM Transactions on Intelligent Systems and Technology (TIST)
Towards automatic assessment of government web sites
Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
Hi-index | 0.00 |
Search engines largely rely on Web robots to collect information from the Web. Due to the unregulated open-access nature of the Web, robot activities are extremely diverse. Such crawling activities can be regulated from the server side by deploying the Robots Exclusion Protocol in a file called robots.txt. Although it is not an enforcement standard, ethical robots (and many commercial) will follow the rules specified in robots.txt. With our focused crawler, we investigate 7,593 websites from education, government, news, and business domains. Five crawls have been conducted in succession to study the temporal changes. Through statistical analysis of the data, we present a survey of the usage of Web robots rules at the Web scale. The results also show that the usage of robots.txt has increased over time.