An investigation of web crawler behavior: characterization and metrics

Authors:
Marios D. Dikaiakos;Athena Stassopoulou;Loizos Papageorgiou
Affiliations:
Department of Computer Science, University of Cyprus, P.O. Box 20537, Kallipoleos 75, Nicosia 1678, Cyprus;Department of Computer Science, Intercollege, P.O. Box 24005, Nicosia, Cyprus;Department of Computer Science, University of Cyprus, P.O. Box 20537, Kallipoleos 75, Nicosia 1678, Cyprus
Venue:
Computer Communications
Year:
2005

Citing 13
Cited 15

Web server workload characterization: the search for invariants

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Self-similarity in World Wide Web traffic: evidence and possible causes

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Generating representative Web workloads for network and server performance evaluation

SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Towards adaptive Web sites: conceptual framework and case study

Artificial Intelligence - Special issue on Intelligent internet systems
Searching the Web

ACM Transactions on Internet Technology (TOIT)
Discovery of Web Robot Sessions Based on their Navigational Patterns

Data Mining and Knowledge Discovery
Changes in Web client access patterns: Characteristics and caching implications

World Wide Web
Clustering the Users of Large Web Sites into Communities

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Design and Implementation of a Distributed Crawler and Filtering Processor

NGITS '02 Proceedings of the 5th International Workshop on Next Generation Information Technologies and Systems
System design issues for internet middleware services: deductions from a large client trace

USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
A distributed middleware infrastructure for personalized services

Computer Communications

Catching web crawlers in the act

ICWE '06 Proceedings of the 6th international conference on Web engineering
Characterizing typical and atypical user sessions in clickstreams

Proceedings of the 17th international conference on World Wide Web
Web robot detection in the scholarly information environment

Journal of Information Science
A three-year study on the freshness of web search engine databases

Journal of Information Science
Web robot detection: A probabilistic reasoning approach

Computer Networks: The International Journal of Computer and Telecommunications Networking
A probabilistic reasoning approach for discovering web crawler sessions

APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
RSS-based blog agents for educational applications

KES'07/WIRN'07 Proceedings of the 11th international conference, KES 2007 and XVII Italian workshop on neural networks conference on Knowledge-based intelligent information and engineering systems: Part I
HengHa: data harvesting detection on hidden databases

Proceedings of the 2010 ACM workshop on Cloud computing security workshop
Web robot detection techniques: overview and limitations

Data Mining and Knowledge Discovery
Analysis of web logs: challenges and findings

PERFORM'10 Proceedings of the 2010 IFIP WG 6.3/7.3 international conference on Performance Evaluation of Computer and Communication Systems: milestones and future challenges
Surviving a search engine overload

Proceedings of the 21st international conference on World Wide Web
PUBCRAWL: protecting users and businesses from CRAWLers

Security'12 Proceedings of the 21st USENIX conference on Security symposium
Access patterns for robots and humans in web archives

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
A comparison of web robot and human requests

Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
An extensive study of Web robots traffic

Proceedings of International Conference on Information Integration and Web-based Applications & Services

Quantified Score

Hi-index	0.24

Visualization

Abstract

In this paper, we present a characterization study of search-engine crawlers. For the purposes of our work, we use Web-server access logs from five academic sites in three different countries. Based on these logs, we analyze the activity of different crawlers that belong to five search engines: Google, AltaVista, Inktomi, FastSearch and CiteSeer. We compare crawler behavior to the characteristics of the general World-Wide Web traffic and to general characterization studies. We analyze crawler requests to derive insights into the behavior and strategy of crawlers. We propose a set of simple metrics that describe qualitative characteristics of crawler behavior, vis-a-vis a crawler's preference on resources of a particular format, its frequency of visits on a Web site, and the pervasiveness of its visits to a particular site. To the best of our knowledge, this is the first extensive and in depth characterization of search-engine crawlers. Our results and observations provide useful insights into crawler behavior and serve as basis of our ongoing work on the automatic detection of Web crawlers.