Can we correctly estimate the total number of pages in Google for a specific language?

  • Authors:
  • Igor A. Bolshakov;Sofia N. Galicia-Haro

  • Affiliations:
  • Center for Computing Research, National Polytechnic Institute, Mexico City, Mexico;Center for Computing Research, National Polytechnic Institute, Mexico City, Mexico

  • Venue:
  • CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
  • Year:
  • 2003

Quantified Score

Hi-index 0.01

Visualization

Abstract

It is argued that for some applications the total amount of web-pages actually stored in an Internet search engine for a specific language is relevant. It is shown that some elementary steps in getting statistics characterizing Google engine's database are bewildering: simple set theory operations gives evidently inconsistent results. Without claiming an ultimate precision, we propose a method of estimation of the total page amount for a given language in a given moment. It takes amounts of Google pages for the words most frequent in a representative text corpus, reorders these words, and gives maximum likelihood estimates for their contributions. The method is applied to Spanish and gives the results with theoretically calculated precision much higher than really needed while resting on such an error-prone mechanism outputting raw statistical data.