Random sampling with a reservoir
ACM Transactions on Mathematical Software (TOMS)
On the relative cost of sampling for join selectivity estimation
PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Reservoir-sampling algorithms of time complexity O(n(1 + log(N/n)))
ACM Transactions on Mathematical Software (TOMS)
Query evaluation: strategies and optimizations
Information Processing and Management: an International Journal
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Analysis of a very large web search engine query log
ACM SIGIR Forum
Estimating simple functions on the union of data streams
Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Mining the web for answers to natural language questions
Proceedings of the tenth international conference on Information and knowledge management
Sampling from a moving window over streaming data
SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Faceted metadata for image search and browsing
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
ACM SIGIR Forum
Efficient query evaluation using a two-level retrieval process
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
On Using Partial Supervision for Text Categorization
IEEE Transactions on Knowledge and Data Engineering
Scaling IR-system evaluation using term relevance sets
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
How to build a WebFountain: An architecture for very large-scale text analytics
IBM Systems Journal
High performance index build algorithms for intranet search engines
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Taxonomies by the numbers: building high-performance taxonomies
Proceedings of the 14th ACM international conference on Information and knowledge management
Random sampling from a search engine's index
Proceedings of the 15th international conference on World Wide Web
POLYPHONET: an advanced social network extraction system from the web
Proceedings of the 15th international conference on World Wide Web
Capturing collection size for distributed non-cooperative retrieval
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Improving personalized web search using result diversification
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating corpus size via queries
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
POLYPHONET: An advanced social network extraction system from the Web
Web Semantics: Science, Services and Agents on the World Wide Web
Random sampling from a search engine's index
Journal of the ACM (JACM)
MedSearch: a specialized search engine for medical information retrieval
Proceedings of the 17th ACM conference on Information and knowledge management
Proceedings of the Second ACM International Conference on Web Search and Data Mining
It takes variety to make a world: diversification in recommender systems
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Investigation of the accuracy of search engine hit counts
Journal of Information Science
Making interval-based clustering rank-aware
Proceedings of the 14th International Conference on Extending Database Technology
Efficient diversification of search results using query logs
Proceedings of the 20th international conference companion on World wide web
Foundations and Trends in Information Retrieval
Efficient diversification of web search results
Proceedings of the VLDB Endowment
Upper-bound approximations for dynamic pruning
ACM Transactions on Information Systems (TOIS)
Suggestion set utility maximization using session logs
Proceedings of the 20th ACM international conference on Information and knowledge management
Size estimation of non-cooperative data collections
Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
Profile diversity in search and recommendation
Proceedings of the 22nd international conference on World Wide Web companion
Topical crawling on the web through local site-searches
Journal of Web Engineering
Latent dirichlet allocation based diversified retrieval for e-commerce search
Proceedings of the 7th ACM international conference on Web search and data mining
Hi-index | 0.00 |
We consider the problem of efficiently sampling Web search engine query results. In turn, using a small random sample instead of the full set of results leads to efficient approximate algorithms for several applications, such as: Determining the set of categories in a given taxonomy spanned by the search results;Finding the range of metadata values associated to the result set in order to enable "multi-faceted search;"Estimating the size of the result set;Data mining associations to the query terms.We present and analyze an efficient algorithm for obtaining uniform random samples applicable to any search engine based on posting lists and document-at-a-time evaluation. (To our knowledge, all popular Web search engines, e.g. Google, Inktomi, AltaVista, AllTheWeb, belong to this class.)Furthermore, our algorithm can be modified to follow the modern object-oriented approach whereby posting lists are viewed as streams equipped with a next method, and the next method for Boolean and other complex queries is built from the next method for primitive terms. In our case we show how to construct a basic next(p) method that samples term posting lists with probability p, and show how to construct next(p) methods for Boolean operators (AND, OR, WAND) from primitive methods.Finally, we test the efficiency and quality of our approach on both synthetic and real-world data.