A technique for measuring the relative size and overlap of public Web search engines
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Measuring index quality using random walks on the Web
WWW '99 Proceedings of the eighth international conference on World Wide Web
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Approximating Aggregate Queries about Web Pages via Random Walks
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
The indexable web is more than 11.5 billion pages
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Random sampling from a search engine's index
Proceedings of the 15th international conference on World Wide Web
Estimating corpus size via queries
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Random sampling from a search engine's index
Journal of the ACM (JACM)
Mining search engine query logs via suggestion sampling
Proceedings of the VLDB Endowment
Privacy preservation of aggregates in hidden databases: why and how?
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
A coherent measurement of web-search relevance
IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans
Estimating deep web data source size by capture---recapture method
Information Retrieval
Turbo-charging hidden database samplers with overflowing queries and skew reduction
Proceedings of the 13th International Conference on Extending Database Technology
Unbiased estimation of size and other aggregates over hidden web databases
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Ranking bias in deep web size estimation using capture recapture method
Data & Knowledge Engineering
Estimating sizes of social networks via biased sampling
Proceedings of the 20th international conference on World wide web
Foundations and Trends in Information Retrieval
Attribute domain discovery for hidden web databases
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient Search Engine Measurements
ACM Transactions on the Web (TWEB)
An overview of Web search evaluation methods
Computers and Electrical Engineering
Aggregate suppression for enterprise search engines
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Optimal algorithms for crawling a hidden database in the web
Proceedings of the VLDB Endowment
Estimating sum by weighted sampling
ICALP'07 Proceedings of the 34th international conference on Automata, Languages and Programming
Size estimation of non-cooperative data collections
Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
Mining a search engine's corpus without a query pool
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Rank discovery from web databases
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
We address the problem of measuring global quality met-rics of search engines, like corpus size, index freshness, anddensity of duplicates in the corpus. The recently proposedestimators for such metrics [2, 6] suffer from significant biasand/or poor performance, due to inaccurate approximationof the so called .document degrees..We present two new estimators that are able to overcomethe bias introduced by approximate degrees. Our estimatorsare based on a careful implementation of an approximateimportance sampling procedure. Comprehensive theoreti-cal and empirical analysis of the estimators demonstratesthat they have essentially no bias even in situations wheredocument degrees are poorly approximated.Building on an idea from [6], we discuss Rao Blackwelliza-tion as a generic method for reducing variance in searchengine estimators. We show that Rao-Blackwellizing ourestimators results in significant performance improvements,while not compromising accuracy.