Mining search engine query logs via suggestion sampling

Authors:
Ziv Bar-Yossef;Maxim Gurevich
Affiliations:
Haifa, Israel and Google Haifa Engineering Center, Israel;Haifa, Israel
Venue:
Proceedings of the VLDB Endowment
Year:
2008

Citing 14
Cited 21

Random generation of combinatorial structures from a uniform

Theoretical Computer Science
Random sampling from B+ trees

VLDB '89 Proceedings of the 15th international conference on Very large data bases
A technique for measuring the relative size and overlap of public Web search engines

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Rank-preserving two-level caching for scalable search engines

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Comparing and aggregating rankings with ties

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
The indexable web is more than 11.5 billion pages

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Accurately interpreting clickthrough data as implicit feedback

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Random sampling from a search engine's index

Proceedings of the 15th international conference on World Wide Web
A picture of search

InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
Estimating corpus size via queries

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Efficient search engine measurements

Proceedings of the 16th international conference on World Wide Web
A random walk approach to sampling hidden databases

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
The impact of caching on search engines

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Monte Carlo Strategies in Scientific Computing

Monte Carlo Strategies in Scientific Computing

Estimating the impressionrank of web pages

Proceedings of the 18th international conference on World wide web
Measure-driven keyword-query expansion

Proceedings of the VLDB Endowment
Mining Query Logs: Turning Search Usage Data into Knowledge

Foundations and Trends in Information Retrieval
Unbiased estimation of size and other aggregates over hidden web databases

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
What can internet search engines "suggest" about the usage and usability of popular desktop applications?

UIST '10 Adjunct proceedings of the 23nd annual ACM symposium on User interface software and technology
Relevance-index size tradeoff in contextual advertising

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Learning website hierarchies for keyword enrichment in contextual advertising

Proceedings of the fourth ACM international conference on Web search and data mining
Query expansion based on clustered results

Proceedings of the VLDB Endowment
Characterizing the usability of interactive applications through query log analysis

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Attribute domain discovery for hidden web databases

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient Search Engine Measurements

ACM Transactions on the Web (TWEB)
Web log analysis: a review of a decade of studies about information acquisition, inspection and interpretation of user interaction

Data Mining and Knowledge Discovery
Stratified k-means clustering over a deep web data source

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Optimal algorithms for crawling a hidden database in the web

Proceedings of the VLDB Endowment
Semantic Query Expansion using Cluster Based Domain Ontologies

International Journal of Information Retrieval Research
Searching the deep web using proactive phrase queries

Proceedings of the 22nd international conference on World Wide Web companion
Mining a search engine's corpus without a query pool

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Semantic discovery from web comparison queries

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
The deep web: woven to catch the middle ground

Proceedings of the 4th international workshop on Web-scale knowledge representation retrieval and reasoning
Analyzing, Detecting, and Exploiting Sentiment in Web Queries

ACM Transactions on the Web (TWEB)
Rank discovery from web databases

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many search engines and other web applications suggest auto-completions as the user types in a query. The suggestions are generated from hidden underlying databases, such as query logs, directories, and lexicons. These databases consist of interesting and useful information, but they are typically not directly accessible. In this paper we describe two algorithms for sampling suggestions using only the public suggestion interface. One of the algorithms samples suggestions uniformly at random and the other samples suggestions proportionally to their popularity. These algorithms can be used to mine the hidden suggestion databases. Example applications include comparison of popularity of given keywords within a search engine's query log, estimation of the volume of commercially-oriented queries in a query log, and evaluation of the extent to which a search engine exposes its users to negative content. Our algorithms employ Monte Carlo methods in order to obtain unbiased samples from the suggestion database. Empirical analysis using a publicly available query log demonstrates that our algorithms are efficient and accurate. Results of experiments on two major suggestion services are also provided.