Efficient search engine measurements

Authors:
Ziv Bar-Yossef;Maxim Gurevich
Affiliations:
Technion, Haifa, Israel;Technion, Haifa, Israel
Venue:
Proceedings of the 16th international conference on World Wide Web
Year:
2007

Citing 8
Cited 20

A technique for measuring the relative size and overlap of public Web search engines

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Measuring index quality using random walks on the Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
On near-uniform URL sampling

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Approximating Aggregate Queries about Web Pages via Random Walks

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
The Little Engines That Could: Modeling the Performance of World Wide Web Search Engines

Marketing Science
The indexable web is more than 11.5 billion pages

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Random sampling from a search engine's index

Proceedings of the 15th international conference on World Wide Web
Estimating corpus size via queries

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management

Random sampling from a search engine's index

Journal of the ACM (JACM)
Mining search engine query logs via suggestion sampling

Proceedings of the VLDB Endowment
Privacy preservation of aggregates in hidden databases: why and how?

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
A coherent measurement of web-search relevance

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans
Estimating deep web data source size by capture---recapture method

Information Retrieval
Turbo-charging hidden database samplers with overflowing queries and skew reduction

Proceedings of the 13th International Conference on Extending Database Technology
Unbiased estimation of size and other aggregates over hidden web databases

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Ranking bias in deep web size estimation using capture recapture method

Data & Knowledge Engineering
Estimating sizes of social networks via biased sampling

Proceedings of the 20th international conference on World wide web
Federated Search

Foundations and Trends in Information Retrieval
Attribute domain discovery for hidden web databases

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient Search Engine Measurements

ACM Transactions on the Web (TWEB)
An overview of Web search evaluation methods

Computers and Electrical Engineering
Aggregate suppression for enterprise search engines

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Optimal algorithms for crawling a hidden database in the web

Proceedings of the VLDB Endowment
Estimating sum by weighted sampling

ICALP'07 Proceedings of the 34th international conference on Automata, Languages and Programming
Size estimation of non-cooperative data collections

Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
Mining a search engine's corpus without a query pool

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Rank discovery from web databases

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

We address the problem of measuring global quality met-rics of search engines, like corpus size, index freshness, anddensity of duplicates in the corpus. The recently proposedestimators for such metrics [2, 6] suffer from significant biasand/or poor performance, due to inaccurate approximationof the so called .document degrees..We present two new estimators that are able to overcomethe bias introduced by approximate degrees. Our estimatorsare based on a careful implementation of an approximateimportance sampling procedure. Comprehensive theoreti-cal and empirical analysis of the estimators demonstratesthat they have essentially no bias even in situations wheredocument degrees are poorly approximated.Building on an idea from [6], we discuss Rao Blackwelliza-tion as a generic method for reducing variance in searchengine estimators. We show that Rao-Blackwellizing ourestimators results in significant performance improvements,while not compromising accuracy.