Multi-style language model for web scale information retrieval

Authors:
Kuansan Wang;Xiaolong Li;Jianfeng Gao
Affiliations:
Microsoft Research, Redmond, WA, USA;Microsoft, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA
Venue:
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Year:
2010

Citing 24
Cited 10

An estimate of an upper bound for the entropy of English

Computational Linguistics
A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A hidden Markov model information retrieval system

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval as statistical translation

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
On Relevance, Probabilistic Indexing and Information Retrieval

Journal of the ACM (JACM)
Document language models, query models, and risk minimization for information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Relevance based language models

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A study of smoothing methods for language models applied to Ad Hoc information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
The Importance of Prior Probabilities for Entry Page Search

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Title language model for information retrieval

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Two-stage language models for information retrieval

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Combining document representations for known-item search

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
A formal study of information retrieval heuristics

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Dependence language model for information retrieval

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Simple BM25 extension to multiple weighted fields

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Query expansion using random walk models

Proceedings of the 14th ACM international conference on Information and knowledge management
Spam double-funnel: connecting web spammers with advertisers

Proceedings of the 16th international conference on World Wide Web
Web resources for language modeling in conversational speech recognition

ACM Transactions on Speech and Language Processing (TSLP)
Introduction to Information Retrieval

Introduction to Information Retrieval
Statistical Language Models for Information Retrieval A Critical Review

Foundations and Trends in Information Retrieval
Search Engines: Information Retrieval in Practice

Search Engines: Information Retrieval in Practice
Efficacy of a constantly adaptive language modeling technique for web-scale applications

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
A machine learning approach for improved BM25 retrieval

Proceedings of the 18th ACM conference on Information and knowledge management
Exploring web scale language models for search query processing

Proceedings of the 19th international conference on World wide web

An overview of Microsoft web N-gram corpus and applications

HLT-DEMO '10 Proceedings of the NAACL HLT 2010 Demonstration Session
Online spelling correction for query completion

Proceedings of the 20th international conference on World wide web
Web scale NLP: a case study on url word breaking

Proceedings of the 20th international conference on World wide web
Clickthrough-based latent semantic models for web search

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Extracting search-focused key n-grams for relevance ranking in web search

Proceedings of the fifth ACM international conference on Web search and data mining
Segmenting web-domains and hashtags using length specific models

Proceedings of the 21st ACM international conference on Information and knowledge management
The downside of markup: examining the harmful effects of CSS and javascript on indexing today's web

Proceedings of the 21st ACM international conference on Information and knowledge management
Salton award lecture information retrieval as engineering science

ACM SIGIR Forum
Representations for multi-document event clustering

Data Mining and Knowledge Discovery
Modeling click-through based word-pairs for web search

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web documents are typically associated with many text streams, including the body, the title and the URL that are determined by the authors, and the anchor text or search queries used by others to refer to the documents. Through a systematic large scale analysis on their cross entropy, we show that these text streams appear to be composed in different language styles, and hence warrant respective language models to properly describe their properties. We propose a language modeling approach to Web document retrieval in which each document is characterized by a mixture model with components corresponding to the various text streams associated with the document. Immediate issues for such a mixture model arise as all the text streams are not always present for the documents, and they do not share the same lexicon, making it challenging to properly combine the statistics from the mixture components. To address these issues, we introduce an 'open-vocabulary' smoothing technique so that all the component language models have the same cardinality and their scores can simply be linearly combined. To ensure that the approach can cope with Web scale applications, the model training algorithm is designed to require no labeled data and can be fully automated with few heuristics and no empirical parameter tunings. The evaluation on Web document ranking tasks shows that the component language models indeed have varying degrees of capabilities as predicted by the cross-entropy analysis, and the combined mixture model outperforms the state-of-the-art BM25F based system.