Foundations of statistical natural language processing
Foundations of statistical natural language processing
Using the web to overcome data sparseness
EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Exploring automatic word sense disambiguation with decision lists and the web
Proceedings of the COLING-2000 Workshop on Semantic Annotation and Intelligent Content
Estimating deep web data source size by capture---recapture method
Information Retrieval
Ranking bias in deep web size estimation using capture recapture method
Data & Knowledge Engineering
Measurements of lexico-syntactic cohesion by means of internet
MICAI'05 Proceedings of the 4th Mexican international conference on Advances in Artificial Intelligence
Hi-index | 0.01 |
It is argued that for some applications the total amount of web-pages actually stored in an Internet search engine for a specific language is relevant. It is shown that some elementary steps in getting statistics characterizing Google engine's database are bewildering: simple set theory operations gives evidently inconsistent results. Without claiming an ultimate precision, we propose a method of estimation of the total page amount for a given language in a given moment. It takes amounts of Google pages for the words most frequent in a representative text corpus, reorders these words, and gives maximum likelihood estimates for their contributions. The method is applied to Spanish and gives the results with theoretically calculated precision much higher than really needed while resting on such an error-prone mechanism outputting raw statistical data.