Hancock: a language for extracting signatures from data streams
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Extended data formatting using Sfio
ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference
Estimating the web robot population
Proceedings of the 19th international conference on World wide web
On identifying academic homepages for digital libraries
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Hi-index | 0.00 |
Mark-recapture models have for many years been used to estimate the unknown sizes of animal and bird populations. In this article we adapt a finite mixture mark-recapture model in order to estimate the number of active telephone lines in the USA. The idea is to use the calling patterns of lines that are observed on the long distance network to estimate the number of lines that do not appear on the network. We present a Bayesian approach and use Markov chain Monte Carlo methods to obtain inference from the posterior distributions of the model parameters. At the state level, our results are in fairly good agreement with recent published reports on line counts. For lines that are easily classified as business or residence, the estimates have low variance. When the classification is unknown, the variability increases considerably. Results are insensitive to changes in the prior distributions. We discuss the significant computational and data mining challenges caused by the scale of the data, approximately 350 million call-detail records per day observed over a number of weeks.