Can we correctly estimate the total number of pages in Google for a specific language?

Authors:
Igor A. Bolshakov;Sofia N. Galicia-Haro
Affiliations:
Center for Computing Research, National Polytechnic Institute, Mexico City, Mexico;Center for Computing Research, National Polytechnic Institute, Mexico City, Mexico
Venue:
CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
Year:
2003

Citing 3
Cited 3

Foundations of statistical natural language processing

Foundations of statistical natural language processing
Using the web to overcome data sparseness

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Exploring automatic word sense disambiguation with decision lists and the web

Proceedings of the COLING-2000 Workshop on Semantic Annotation and Intelligent Content

Estimating deep web data source size by capture---recapture method

Information Retrieval
Ranking bias in deep web size estimation using capture recapture method

Data & Knowledge Engineering
Measurements of lexico-syntactic cohesion by means of internet

MICAI'05 Proceedings of the 4th Mexican international conference on Advances in Artificial Intelligence

Quantified Score

Hi-index	0.01

Visualization

Abstract

It is argued that for some applications the total amount of web-pages actually stored in an Internet search engine for a specific language is relevant. It is shown that some elementary steps in getting statistics characterizing Google engine's database are bewildering: simple set theory operations gives evidently inconsistent results. Without claiming an ultimate precision, we propose a method of estimation of the total page amount for a given language in a given moment. It takes amounts of Google pages for the words most frequent in a representative text corpus, reorders these words, and gives maximum likelihood estimates for their contributions. The method is applied to Spanish and gives the results with theoretically calculated precision much higher than really needed while resting on such an error-prone mechanism outputting raw statistical data.