An estimate of an upper bound for the entropy of English
Computational Linguistics
A language modeling approach to information retrieval
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A hidden Markov model information retrieval system
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval as statistical translation
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
On Relevance, Probabilistic Indexing and Information Retrieval
Journal of the ACM (JACM)
Document language models, query models, and risk minimization for information retrieval
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Relevance based language models
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A study of smoothing methods for language models applied to Ad Hoc information retrieval
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
The Importance of Prior Probabilities for Entry Page Search
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Title language model for information retrieval
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Two-stage language models for information retrieval
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Combining document representations for known-item search
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
A formal study of information retrieval heuristics
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Dependence language model for information retrieval
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Simple BM25 extension to multiple weighted fields
Proceedings of the thirteenth ACM international conference on Information and knowledge management
Query expansion using random walk models
Proceedings of the 14th ACM international conference on Information and knowledge management
Spam double-funnel: connecting web spammers with advertisers
Proceedings of the 16th international conference on World Wide Web
Web resources for language modeling in conversational speech recognition
ACM Transactions on Speech and Language Processing (TSLP)
Introduction to Information Retrieval
Introduction to Information Retrieval
Statistical Language Models for Information Retrieval A Critical Review
Foundations and Trends in Information Retrieval
Search Engines: Information Retrieval in Practice
Search Engines: Information Retrieval in Practice
Efficacy of a constantly adaptive language modeling technique for web-scale applications
ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
A machine learning approach for improved BM25 retrieval
Proceedings of the 18th ACM conference on Information and knowledge management
Exploring web scale language models for search query processing
Proceedings of the 19th international conference on World wide web
An overview of Microsoft web N-gram corpus and applications
HLT-DEMO '10 Proceedings of the NAACL HLT 2010 Demonstration Session
Online spelling correction for query completion
Proceedings of the 20th international conference on World wide web
Web scale NLP: a case study on url word breaking
Proceedings of the 20th international conference on World wide web
Clickthrough-based latent semantic models for web search
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Extracting search-focused key n-grams for relevance ranking in web search
Proceedings of the fifth ACM international conference on Web search and data mining
Segmenting web-domains and hashtags using length specific models
Proceedings of the 21st ACM international conference on Information and knowledge management
The downside of markup: examining the harmful effects of CSS and javascript on indexing today's web
Proceedings of the 21st ACM international conference on Information and knowledge management
Representations for multi-document event clustering
Data Mining and Knowledge Discovery
Modeling click-through based word-pairs for web search
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Hi-index | 0.00 |
Web documents are typically associated with many text streams, including the body, the title and the URL that are determined by the authors, and the anchor text or search queries used by others to refer to the documents. Through a systematic large scale analysis on their cross entropy, we show that these text streams appear to be composed in different language styles, and hence warrant respective language models to properly describe their properties. We propose a language modeling approach to Web document retrieval in which each document is characterized by a mixture model with components corresponding to the various text streams associated with the document. Immediate issues for such a mixture model arise as all the text streams are not always present for the documents, and they do not share the same lexicon, making it challenging to properly combine the statistics from the mixture components. To address these issues, we introduce an 'open-vocabulary' smoothing technique so that all the component language models have the same cardinality and their scores can simply be linearly combined. To ensure that the approach can cope with Web scale applications, the model training algorithm is designed to require no labeled data and can be fully automated with few heuristics and no empirical parameter tunings. The evaluation on Web document ranking tasks shows that the component language models indeed have varying degrees of capabilities as predicted by the cross-entropy analysis, and the combined mixture model outperforms the state-of-the-art BM25F based system.