Efficacy of a constantly adaptive language modeling technique for web-scale applications

Authors:
Kuansan Wang; Xiaolong Li
Affiliations:
Internet Service Research Center (ISRC), Microsoft Research, Microsoft Corporation, Redmond, WA 98052, USA;Internet Service Research Center (ISRC), Microsoft Research, Microsoft Corporation, Redmond, WA 98052, USA
Venue:
ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Year:
2009

Citing 0
Cited 5

Exploring web scale language models for search query processing

Proceedings of the 19th international conference on World wide web
Multi-style language model for web scale information retrieval

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
An overview of Microsoft web N-gram corpus and applications

HLT-DEMO '10 Proceedings of the NAACL HLT 2010 Demonstration Session
Focusing on novelty: a crawling strategy to build diverse language models

Proceedings of the 20th ACM international conference on Information and knowledge management
Adaptive clustering for coreference resolution with deterministic rules and web-based language models

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we describe CALM, a method for building statistical language models for the Web. CALM addresses several unique challenges dealing with the Web contents. First, CALM does not rely on the whole corpus to be available to build the language model. Instead, we design CALM to progressively adapt itself as Web chunks are made available by the crawler. Second, given the dynamic and dramatic changes in the Web contents, CALM is designed to quickly enrich its lexicon and N-grams as new vocabulary and phrases are discovered. To reduce the amount of heuristics and human interventions typically needed for model adaptation, we derive an information theoretical formula for CALM to facilitate the optimal adaptation in the maximum a posteriori (MAP) sense. Testing against a collection of Web chunks where new vocabulary and phrases are dominant, we show CALM can achieve comparable and satisfactory model measured in perplexity. We also show CALM is robust against over training and the initial condition, suggesting that any assumptions made in obtaining the initial model can gradually see their impacts diminished as CALM runs its full course and adapt to more data.