Monitoring the evolution of cached content in Google and MSN
Proceedings of the 16th international conference on World Wide Web
Ranking bias in deep web size estimation using capture recapture method
Data & Knowledge Engineering
Hi-index | 0.00 |
This paper proposes a statistical approach for estimating the evolution of web pages in directories. The proposal is based on the capture-recapture method used in wildlife biological studies in an animal, bird or fish populations, and it is modified according to the necessary assumptions and amendments for applying the experiments in a search engine directory. During these experiments, web pages are considered as animals and the specific types of web pages as particular species of animals whose abundance, birth, death and survival rates are estimated. The population is open, meaning that new web pages are submitted to the search engine directory, while others are removed from the directory indexes, resembling to emigration/immigration processes in nature. The role of the biologist who recognizes the species under study and records their history is assigned to a web page classifier, which is trained under the Open Directory's (DMOZ project) taxonomy. The classifier is a three layer Probabilistic Neural Network capable of identifying and categorizing web pages, on the basis of information filtering. A virtual experiment is simulated based on the classifier performance over real web pages, while the results are quite promising.