Efficient Search Engine Measurements

Authors:
Ziv Bar-Yossef;Maxim Gurevich
Affiliations:
Google Israel and Technion, Haifa, Israel;Yahoo! Research, Santa Clara, CA
Venue:
ACM Transactions on the Web (TWEB)
Year:
2011

Citing 15
Cited 3

A technique for measuring the relative size and overlap of public Web search engines

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Measuring index quality using random walks on the Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Accessibility of information on the Web

intelligence
On near-uniform URL sampling

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Approximating Aggregate Queries about Web Pages via Random Walks

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
The Little Engines That Could: Modeling the Performance of World Wide Web Search Engines

Marketing Science
The indexable web is more than 11.5 billion pages

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Estimating corpus size via queries

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Sampling Search-Engine Results

World Wide Web
Efficient search engine measurements

Proceedings of the 16th international conference on World Wide Web
A random walk approach to sampling hidden databases

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Random sampling from a search engine's index

Journal of the ACM (JACM)
Mining search engine query logs via suggestion sampling

Proceedings of the VLDB Endowment
Estimating the impressionrank of web pages

Proceedings of the 18th international conference on World wide web
Monte Carlo Strategies in Scientific Computing

Monte Carlo Strategies in Scientific Computing

Size estimation of non-cooperative data collections

Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
Estimating clustering coefficients and size of social networks via random walk

Proceedings of the 22nd international conference on World Wide Web
On estimating the average degree

Proceedings of the 23rd international conference on World wide web

Quantified Score

Hi-index	0.02

Visualization

Abstract

We address the problem of externally measuring aggregate functions over documents indexed by search engines, like corpus size, index freshness, and density of duplicates in the corpus. State of the art estimators for such quantities [Bar-Yossef and Gurevich 2008b; Broder et al. 2006] are biased due to inaccurate approximation of the so called “document degrees”. In addition, the estimators in Bar-Yossef and Gurevich [2008b] are quite costly, due to their reliance on rejection sampling. We present new estimators that are able to overcome the bias introduced by approximate degrees. Our estimators are based on a careful implementation of an approximate importance sampling procedure. Comprehensive theoretical and empirical analysis of the estimators demonstrates that they have essentially no bias even in situations where document degrees are poorly approximated. By avoiding the costly rejection sampling approach, our new importance sampling estimators are significantly more efficient than the estimators proposed in Bar-Yossef and Gurevich [2008b]. Furthermore, building on an idea from Broder et al. [2006], we discuss Rao-Blackwellization as a generic method for reducing variance in search engine estimators. We show that Rao-Blackwellizing our estimators results in performance improvements, without compromising accuracy.