A technique for measuring the relative size and overlap of public Web search engines
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Measuring index quality using random walks on the Web
WWW '99 Proceedings of the eighth international conference on World Wide Web
Accessibility of information on the Web
intelligence
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Approximating Aggregate Queries about Web Pages via Random Walks
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
The indexable web is more than 11.5 billion pages
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Estimating corpus size via queries
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Sampling Search-Engine Results
World Wide Web
Efficient search engine measurements
Proceedings of the 16th international conference on World Wide Web
A random walk approach to sampling hidden databases
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Random sampling from a search engine's index
Journal of the ACM (JACM)
Mining search engine query logs via suggestion sampling
Proceedings of the VLDB Endowment
Estimating the impressionrank of web pages
Proceedings of the 18th international conference on World wide web
Monte Carlo Strategies in Scientific Computing
Monte Carlo Strategies in Scientific Computing
Size estimation of non-cooperative data collections
Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
Estimating clustering coefficients and size of social networks via random walk
Proceedings of the 22nd international conference on World Wide Web
On estimating the average degree
Proceedings of the 23rd international conference on World wide web
Hi-index | 0.02 |
We address the problem of externally measuring aggregate functions over documents indexed by search engines, like corpus size, index freshness, and density of duplicates in the corpus. State of the art estimators for such quantities [Bar-Yossef and Gurevich 2008b; Broder et al. 2006] are biased due to inaccurate approximation of the so called “document degrees”. In addition, the estimators in Bar-Yossef and Gurevich [2008b] are quite costly, due to their reliance on rejection sampling. We present new estimators that are able to overcome the bias introduced by approximate degrees. Our estimators are based on a careful implementation of an approximate importance sampling procedure. Comprehensive theoretical and empirical analysis of the estimators demonstrates that they have essentially no bias even in situations where document degrees are poorly approximated. By avoiding the costly rejection sampling approach, our new importance sampling estimators are significantly more efficient than the estimators proposed in Bar-Yossef and Gurevich [2008b]. Furthermore, building on an idea from Broder et al. [2006], we discuss Rao-Blackwellization as a generic method for reducing variance in search engine estimators. We show that Rao-Blackwellizing our estimators results in performance improvements, without compromising accuracy.