An investigation of web crawler behavior: characterization and metrics

  • Authors:
  • Marios D. Dikaiakos;Athena Stassopoulou;Loizos Papageorgiou

  • Affiliations:
  • Department of Computer Science, University of Cyprus, P.O. Box 20537, Kallipoleos 75, Nicosia 1678, Cyprus;Department of Computer Science, Intercollege, P.O. Box 24005, Nicosia, Cyprus;Department of Computer Science, University of Cyprus, P.O. Box 20537, Kallipoleos 75, Nicosia 1678, Cyprus

  • Venue:
  • Computer Communications
  • Year:
  • 2005

Quantified Score

Hi-index 0.24

Visualization

Abstract

In this paper, we present a characterization study of search-engine crawlers. For the purposes of our work, we use Web-server access logs from five academic sites in three different countries. Based on these logs, we analyze the activity of different crawlers that belong to five search engines: Google, AltaVista, Inktomi, FastSearch and CiteSeer. We compare crawler behavior to the characteristics of the general World-Wide Web traffic and to general characterization studies. We analyze crawler requests to derive insights into the behavior and strategy of crawlers. We propose a set of simple metrics that describe qualitative characteristics of crawler behavior, vis-a-vis a crawler's preference on resources of a particular format, its frequency of visits on a Web site, and the pervasiveness of its visits to a particular site. To the best of our knowledge, this is the first extensive and in depth characterization of search-engine crawlers. Our results and observations provide useful insights into crawler behavior and serve as basis of our ongoing work on the automatic detection of Web crawlers.