Web robot detection in the scholarly information environment

Authors:
Paul Huntington;David Nicholas;Hamid R. Jamali
Affiliations:
School of Library, Archive and Information Studies,University College London;School of Library, Archive and Information Studies,University College London;School of Library, Archive and Information Studies,University College London
Venue:
Journal of Information Science
Year:
2008

Citing 7
Cited 4

Ethical Web agents

Computer Networks and ISDN Systems
Discovery of Web Robot Sessions Based on their Navigational Patterns

Data Mining and Knowledge Discovery
The dark side of the Web: an open proxy's view

ACM SIGCOMM Computer Communication Review
Catching web crawlers in the act

ICWE '06 Proceedings of the 6th international conference on Web engineering
Securing web service by automatic robot detection

ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
An investigation of web crawler behavior: characterization and metrics

Computer Communications
CAPTCHA: using hard AI problems for security

EUROCRYPT'03 Proceedings of the 22nd international conference on Theory and applications of cryptographic techniques

Web robot detection techniques: overview and limitations

Data Mining and Knowledge Discovery
Web robot detection based on pattern-matching technique

Journal of Information Science
A classification framework for web robots

Journal of the American Society for Information Science and Technology
A comparison of web robot and human requests

Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

An increasing number of robots harvest information on the world wide web for a wide variety of purposes. Protocols developed at the inception of the web laid out voluntary procedures in order to identify robot behaviour, and exclude it if necessary. Few robots now follow this protocol and it is now increasingly difficult to filter for this activity in reports of on-site activity. This paper seeks to demonstrate the issues involved in identifying robots and assessing their impact on usage in regard to a project which sought to establish the relative usage patterns of open access and non-open access articles in the Oxford University Press published journal Glycobiology, which offers in a single issue articles in both forms. A number of methods for identifying robots are compared and together these methods found that 40% of the raw logs of this journal could be attributed to robots.