Catching web crawlers in the act

Authors:
Anália G. Lourenço;Orlando O. Belo
Affiliations:
Universidade do Minho, Braga, Portugal;Universidade do Minho, Braga, Portugal
Venue:
ICWE '06 Proceedings of the 6th international conference on Web engineering
Year:
2006

Citing 6
Cited 4

C4.5: programs for machine learning

C4.5: programs for machine learning
Ethical Web agents

Computer Networks and ISDN Systems
A framework for constructing features and models for intrusion detection systems

ACM Transactions on Information and System Security (TISSEC)
Discovery of Web Robot Sessions Based on their Navigational Patterns

Data Mining and Knowledge Discovery
Information extraction for enhanced access to disease outbreak reports

Journal of Biomedical Informatics - Special issue: Sublanguage
An investigation of web crawler behavior: characterization and metrics

Computer Communications

Web robot detection in the scholarly information environment

Journal of Information Science
Intelligent Social Media Indexing and Sharing Using an Adaptive Indexing Search Engine

ACM Transactions on Intelligent Systems and Technology (TIST)
Web robot detection based on pattern-matching technique

Journal of Information Science
Effective web log mining and online navigational pattern prediction

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper recommends a new approach to the detection and containment of Web crawler traverses based on clickstream data mining. Timely detection prevents crawler abusive consumption of Web server resources and eventual site contents privacy or copyrights violation. Clickstream data differentiation ensures focused usage analysis, valuable both for regular users and crawler profiling. Our platform, named ClickTips, sustains a site-specific, updatable detection model that tags Web crawler traverses based on incremental Web session inspection and a decision model that assesses eventual containment. The goal is to deliver a model flexible enough to keep up with crawling continuous evolving and that is capable of detecting crawler presence as soon as possible. We use a real-world Web site case study as a support for process description, as well as, to evaluate the accuracy of the obtained classification models and their ability for discovering previously unknown Web crawlers.