Estimating the size of the telephone universe: a Bayesian Mark-recapture approach
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
On identifying academic homepages for digital libraries
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Hi-index | 0.00 |
In this research, capture-recapture (CR) models are used to estimate the population of web robots based on web server access logs from different websites. Each robot is considered as an individual randomly surfing the web and each website is considered as a trap that records the visitation of robots. We use maximum likelihood estimator to fit the observation data. Results show that there are 3,860 identifiable robot User-Agent strings and 780,760 IP addresses being used by web robots around the world. We also examine the origination of the named robots by their IP addresses. The results suggest that over 50% of web robot IP addresses are from United States and China.