Exploring web scale language models for search query processing

Authors:
Jian Huang;Jianfeng Gao;Jiangbo Miao;Xiaolong Li;Kuansan Wang;Fritz Behr;C. Lee Giles
Affiliations:
Pennsylvania State University, University Park, PA, USA;Microsoft Research, Redmond, WA, USA;Facebook, Palo Alto, CA, USA;Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA;Microsoft Corporation, Redmond, WA, USA;Information Sciences and Technology, Pennsylvania State University, PA, USA
Venue:
Proceedings of the 19th international conference on World wide web
Year:
2010

Citing 17
Cited 22

Searching the Web: the public and their queries

Journal of the American Society for Information Science and Technology
Spoken Language Processing: A Guide to Theory, Algorithm, and System Development

Spoken Language Processing: A Guide to Theory, Algorithm, and System Development
Corpus statistics meet the noun compound: some empirical results

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Combining Trigram-based and feature-based methods for context-sensitive spelling correction

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
A study of smoothing methods for language models applied to information retrieval

ACM Transactions on Information Systems (TOIS)
Scaling to very very large corpora for natural language disambiguation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Generating query substitutions

Proceedings of the 15th international conference on World Wide Web
Improving web search ranking by incorporating user behavior information

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Unsupervised query segmentation using generative language models and wikipedia

Proceedings of the 17th international conference on World Wide Web
The Unreasonable Effectiveness of Data

IEEE Intelligent Systems
Smoothing clickthrough data for web search ranking

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Reducing long queries using query quality predictors

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Efficacy of a constantly adaptive language modeling technique for web-scale applications

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
The linguistic structure of English web-search queries

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
A machine learning approach for improved BM25 retrieval

Proceedings of the 18th ACM conference on Information and knowledge management
Web-scale N-gram models for lexical disambiguation

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Search engine statistics beyond the n-gram: application to noun compound bracketing

CONLL '05 Proceedings of the Ninth Conference on Computational Natural Language Learning

Multi-style language model for web scale information retrieval

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
An overview of Microsoft web N-gram corpus and applications

HLT-DEMO '10 Proceedings of the NAACL HLT 2010 Demonstration Session
Clickthrough-based translation models for web search: from word models to phrase models

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Query segmentation revisited

Proceedings of the 20th international conference on World wide web
Web scale NLP: a case study on url word breaking

Proceedings of the 20th international conference on World wide web
Key concepts identification and weighting in search engine queries

APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Clickthrough-based latent semantic models for web search

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Synthesizing high utility suggestions for rare web search queries

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Tulsa: web search for writing assistance

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
TWC LOGD: A portal for linked open government data ecosystems

Web Semantics: Science, Services and Agents on the World Wide Web
Focusing on novelty: a crawling strategy to build diverse language models

Proceedings of the 20th ACM international conference on Information and knowledge management
Automatically mining question reformulation patterns from search log data

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
Learning lexicon models from search logs for query expansion

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Towards optimum query segmentation: in doubt without

Proceedings of the 21st ACM international conference on Information and knowledge management
Robust query rewriting using anchor data

Proceedings of the sixth ACM international conference on Web search and data mining
Modeling reformulation using query distributions

ACM Transactions on Information Systems (TOIS)
Ontology-based personalised retrieval in support of reminiscence

Knowledge-Based Systems
Query expansion using path-constrained random walks

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Modeling click-through based word-pairs for web search

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Beyond clicks: query reformulation as a predictor of search satisfaction

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
On segmentation of eCommerce queries

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Towards Concept-Based Translation Models Using Search Logs for Query Expansion

Proceedings of the 21st ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

It has been widely observed that search queries are composed in a very different style from that of the body or the title of a document. Many techniques explicitly accounting for this language style discrepancy have shown promising results for information retrieval, yet a large scale analysis on the extent of the language differences has been lacking. In this paper, we present an extensive study on this issue by examining the language model properties of search queries and the three text streams associated with each web document: the body, the title, and the anchor text. Our information theoretical analysis shows that queries seem to be composed in a way most similar to how authors summarize documents in anchor texts or titles, offering a quantitative explanation to the observations in past work. We apply these web scale n-gram language models to three search query processing (SQP) tasks: query spelling correction, query bracketing and long query segmentation. By controlling the size and the order of different language models, we find that the perplexity metric to be a good accuracy indicator for these query processing tasks. We show that using smoothed language models yields significant accuracy gains for query bracketing for instance, compared to using web counts as in the literature. We also demonstrate that applying web-scale language models can have marked accuracy advantage over smaller ones.