Entropy of search logs: how hard is search? with personalization? with backoff?

Authors:
Qiaozhu Mei;Kenneth Church
Affiliations:
University of Illinois at Urbana Champaign, Urbana, IL;Microsoft Research, Redmond, WA
Venue:
WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Year:
2008

Citing 22
Cited 26

A technique for measuring the relative size and overlap of public Web search engines

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Accessibility of information on the Web

intelligence
A study of smoothing methods for language models applied to Ad Hoc information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A survey of Web metrics

ACM Computing Surveys (CSUR)
Signature-Based Methods for Data Streams

Data Mining and Knowledge Discovery
Optimizing search engines using clickthrough data

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Scaling personalized web search

WWW '03 Proceedings of the 12th international conference on World Wide Web
Query word deletion prediction

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Introduction to the special issue on computational linguistics using large corpora

Computational Linguistics - Special issue on using large corpora: I
An empirical study of smoothing techniques for language modeling

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Adaptive web search based on user profile constructed without any effort from users

Proceedings of the 13th international conference on World Wide Web
The indexable web is more than 11.5 billion pages

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Context-sensitive information retrieval using implicit feedback

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Personalizing search via automated analysis of interests and activities

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
UCAIR: a personalized search toolbar

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Query chains: learning to rank from implicit feedback

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Generating query substitutions

Proceedings of the 15th international conference on World Wide Web
Mining long-term search history to improve search accuracy

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
The Long Tail: Why the Future of Business Is Selling Less of More

The Long Tail: Why the Future of Business Is Selling Less of More
Examining the effectiveness of real-time query expansion

Information Processing and Management: an International Journal
The wild thing!

ACLdemo '05 Proceedings of the ACL 2005 on Interactive poster and demonstration sessions
Claude E. Shannon: a retrospective on his life, work, and impact

IEEE Transactions on Information Theory

Understanding the relationship between searchers' queries and information goals

Proceedings of the 17th ACM conference on Information and knowledge management
Query suggestion using hitting time

Proceedings of the 17th ACM conference on Information and knowledge management
Enhancing collaborative web search with personalization: groupization, smart splitting, and group hit-highlighting

Proceedings of the 2008 ACM conference on Computer supported cooperative work
Discovering and using groups to improve personalized search

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Analysis of long queries in a large scale search log

Proceedings of the 2009 workshop on Web Search Click Data
An algorithm for analyzing personalized online commercial intention

Proceedings of the 2nd International Workshop on Data Mining and Audience Intelligence for Advertising
Stratified analysis of AOL query log

Information Sciences: an International Journal
Spatio-temporal models for estimating click-through rate

Proceedings of the 18th international conference on World wide web
What queries are likely to recur in web search?

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Web Observation from a User Perspective

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Improving compressed counting

UAI '09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
The demographics of web search

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Query ambiguity revisited: clickthrough measures for distinguishing informational and ambiguous queries

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Web search solved?: all result rankings the same?

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Inferring and using location metadata to personalize web search

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Improving local search ranking through external logs

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Domain bias in web search

Proceedings of the fifth ACM international conference on Web search and data mining
Finding trending local topics in search queries for personalization of a recommendation system

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Revisiting the predictability of language: response completion in social media

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Click patterns: an empirical representation of complex query intents

Proceedings of the 21st ACM international conference on Information and knowledge management
Enhancing personalized search by mining and modeling task behavior

Proceedings of the 22nd international conference on World Wide Web
Questions about questions: an empirical analysis of information needs on Twitter

Proceedings of the 22nd international conference on World Wide Web
A probabilistic mixture model for mining and analyzing product search log

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Mining search and browse logs for web search: A Survey

ACM Transactions on Intelligent Systems and Technology (TIST) - Survey papers, special sections on the semantic adaptive social web, intelligent systems for health informatics, regular papers
Personalised Information Retrieval: survey and classification

User Modeling and User-Adapted Interaction
Investigating query bursts in a web search engine

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

How many pages are there on the Web? 5B? 20B? More? Less? Big bets on clusters in the clouds could be wiped out if a small cache of a few million urls could capture much of the value. Language modeling techniques are applied to MSN's search logs to estimate entropy. The perplexity is surprisingly small: millions, not billions. Entropy is a powerful tool for sizing challenges and opportunities. How hard is search? How hard are query suggestion mechanisms like auto-complete? How much does personalization help? All these difficult questions can be answered by estimation of entropy from search logs. What is the potential opportunity for personalization? In this paper, we propose a new way to personalize search, personalization with backoff. If we have relevant data for a particular user, we should use it. But if we don't, back off to larger and larger classes of similar users. As a proof of concept, we use the first few bytes of the IP address to define classes. The coefficients of each backoff class are estimated with an EM algorithm. Ideally, classes would be defined by market segments, demographics and surrogate variables such as time and geography