Web server workload characterization: the search for invariants
Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Self-similarity in World Wide Web traffic: evidence and possible causes
Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Generating representative Web workloads for network and server performance evaluation
SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
Towards adaptive Web sites: conceptual framework and case study
Artificial Intelligence - Special issue on Intelligent internet systems
ACM Transactions on Internet Technology (TOIT)
Discovery of Web Robot Sessions Based on their Navigational Patterns
Data Mining and Knowledge Discovery
Clustering the Users of Large Web Sites into Communities
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Design and Implementation of a Distributed Crawler and Filtering Processor
NGITS '02 Proceedings of the 5th International Workshop on Next Generation Information Technologies and Systems
System design issues for internet middleware services: deductions from a large client trace
USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
A distributed middleware infrastructure for personalized services
Computer Communications
Catching web crawlers in the act
ICWE '06 Proceedings of the 6th international conference on Web engineering
Characterizing typical and atypical user sessions in clickstreams
Proceedings of the 17th international conference on World Wide Web
Web robot detection in the scholarly information environment
Journal of Information Science
A three-year study on the freshness of web search engine databases
Journal of Information Science
Web robot detection: A probabilistic reasoning approach
Computer Networks: The International Journal of Computer and Telecommunications Networking
A probabilistic reasoning approach for discovering web crawler sessions
APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
RSS-based blog agents for educational applications
KES'07/WIRN'07 Proceedings of the 11th international conference, KES 2007 and XVII Italian workshop on neural networks conference on Knowledge-based intelligent information and engineering systems: Part I
HengHa: data harvesting detection on hidden databases
Proceedings of the 2010 ACM workshop on Cloud computing security workshop
Web robot detection techniques: overview and limitations
Data Mining and Knowledge Discovery
Analysis of web logs: challenges and findings
PERFORM'10 Proceedings of the 2010 IFIP WG 6.3/7.3 international conference on Performance Evaluation of Computer and Communication Systems: milestones and future challenges
Surviving a search engine overload
Proceedings of the 21st international conference on World Wide Web
PUBCRAWL: protecting users and businesses from CRAWLers
Security'12 Proceedings of the 21st USENIX conference on Security symposium
Access patterns for robots and humans in web archives
Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
A comparison of web robot and human requests
Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
An extensive study of Web robots traffic
Proceedings of International Conference on Information Integration and Web-based Applications & Services
Hi-index | 0.24 |
In this paper, we present a characterization study of search-engine crawlers. For the purposes of our work, we use Web-server access logs from five academic sites in three different countries. Based on these logs, we analyze the activity of different crawlers that belong to five search engines: Google, AltaVista, Inktomi, FastSearch and CiteSeer. We compare crawler behavior to the characteristics of the general World-Wide Web traffic and to general characterization studies. We analyze crawler requests to derive insights into the behavior and strategy of crawlers. We propose a set of simple metrics that describe qualitative characteristics of crawler behavior, vis-a-vis a crawler's preference on resources of a particular format, its frequency of visits on a Web site, and the pervasiveness of its visits to a particular site. To the best of our knowledge, this is the first extensive and in depth characterization of search-engine crawlers. Our results and observations provide useful insights into crawler behavior and serve as basis of our ongoing work on the automatic detection of Web crawlers.