Estimating the web robot population

Authors:
Yang Sun;C. Lee Giles
Affiliations:
AOL Research, Mountain View, CA, USA;The Pennsylvania State University, University Park, PA, USA
Venue:
Proceedings of the 19th international conference on World wide web
Year:
2010

Citing 1
Cited 1

Estimating the size of the telephone universe: a Bayesian Mark-recapture approach

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining

On identifying academic homepages for digital libraries

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this research, capture-recapture (CR) models are used to estimate the population of web robots based on web server access logs from different websites. Each robot is considered as an individual randomly surfing the web and each website is considered as a trap that records the visitation of robots. We use maximum likelihood estimator to fit the observation data. Results show that there are 3,860 identifiable robot User-Agent strings and 780,760 IP addresses being used by web robots around the world. We also examine the origination of the named robots by their IP addresses. The results suggest that over 50% of web robot IP addresses are from United States and China.