Sampling search-engine results

Authors:
Aris Anagnostopoulos;Andrei Z. Broder;David Carmel
Affiliations:
Brown University, Providence, RI;IBM T. J. Watson Research Center, Hawthorne, NY;IBM Haifa Research Lab, Haifa, ISRAEL
Venue:
WWW '05 Proceedings of the 14th international conference on World Wide Web
Year:
2005

Citing 16
Cited 22

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
On the relative cost of sampling for join selectivity estimation

PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Reservoir-sampling algorithms of time complexity O(n(1 + log(N/n)))

ACM Transactions on Mathematical Software (TOMS)
Query evaluation: strategies and optimizations

Information Processing and Management: an International Journal
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Analysis of a very large web search engine query log

ACM SIGIR Forum
Estimating simple functions on the union of data streams

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Mining the web for answers to natural language questions

Proceedings of the tenth international conference on Information and knowledge management
Sampling from a moving window over streaming data

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Faceted metadata for image search and browsing

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
A taxonomy of web search

ACM SIGIR Forum
Efficient query evaluation using a two-level retrieval process

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
On Using Partial Supervision for Text Categorization

IEEE Transactions on Knowledge and Data Engineering
Scaling IR-system evaluation using term relevance sets

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
How to build a WebFountain: An architecture for very large-scale text analytics

IBM Systems Journal
High performance index build algorithms for intranet search engines

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30

Taxonomies by the numbers: building high-performance taxonomies

Proceedings of the 14th ACM international conference on Information and knowledge management
Random sampling from a search engine's index

Proceedings of the 15th international conference on World Wide Web
POLYPHONET: an advanced social network extraction system from the web

Proceedings of the 15th international conference on World Wide Web
Capturing collection size for distributed non-cooperative retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Improving personalized web search using result diversification

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating corpus size via queries

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
POLYPHONET: An advanced social network extraction system from the Web

Web Semantics: Science, Services and Agents on the World Wide Web
Random sampling from a search engine's index

Journal of the ACM (JACM)
MedSearch: a specialized search engine for medical information retrieval

Proceedings of the 17th ACM conference on Information and knowledge management
Diversifying search results

Proceedings of the Second ACM International Conference on Web Search and Data Mining
It takes variety to make a world: diversification in recommender systems

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Investigation of the accuracy of search engine hit counts

Journal of Information Science
Making interval-based clustering rank-aware

Proceedings of the 14th International Conference on Extending Database Technology
Efficient diversification of search results using query logs

Proceedings of the 20th international conference companion on World wide web
Federated Search

Foundations and Trends in Information Retrieval
Efficient diversification of web search results

Proceedings of the VLDB Endowment
Upper-bound approximations for dynamic pruning

ACM Transactions on Information Systems (TOIS)
Suggestion set utility maximization using session logs

Proceedings of the 20th ACM international conference on Information and knowledge management
Size estimation of non-cooperative data collections

Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
Profile diversity in search and recommendation

Proceedings of the 22nd international conference on World Wide Web companion
Topical crawling on the web through local site-searches

Journal of Web Engineering
Latent dirichlet allocation based diversified retrieval for e-commerce search

Proceedings of the 7th ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of efficiently sampling Web search engine query results. In turn, using a small random sample instead of the full set of results leads to efficient approximate algorithms for several applications, such as: Determining the set of categories in a given taxonomy spanned by the search results;Finding the range of metadata values associated to the result set in order to enable "multi-faceted search;"Estimating the size of the result set;Data mining associations to the query terms.We present and analyze an efficient algorithm for obtaining uniform random samples applicable to any search engine based on posting lists and document-at-a-time evaluation. (To our knowledge, all popular Web search engines, e.g. Google, Inktomi, AltaVista, AllTheWeb, belong to this class.)Furthermore, our algorithm can be modified to follow the modern object-oriented approach whereby posting lists are viewed as streams equipped with a next method, and the next method for Boolean and other complex queries is built from the next method for primitive terms. In our case we show how to construct a basic next(p) method that samples term posting lists with probability p, and show how to construct next(p) methods for Boolean operators (AND, OR, WAND) from primitive methods.Finally, we test the efficiency and quality of our approach on both synthetic and real-world data.