An overview of Microsoft web N-gram corpus and applications

Authors:
Kuansan Wang;Christopher Thrasher;Evelyne Viegas;Xiaolong Li;Bo-june (Paul) Hsu
Affiliations:
Microsoft Research, Redmond, WA;Microsoft Research, Redmond, WA;Microsoft Research, Redmond, WA;Microsoft Research, Redmond, WA;Microsoft Research, Redmond, WA
Venue:
HLT-DEMO '10 Proceedings of the NAACL HLT 2010 Demonstration Session
Year:
2010

Citing 4
Cited 23

Mitigating the paucity-of-data problem: exploring the effect of training corpus size on classifier performance for natural language processing

HLT '01 Proceedings of the first international conference on Human language technology research
Efficacy of a constantly adaptive language modeling technique for web-scale applications

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Exploring web scale language models for search query processing

Proceedings of the 19th international conference on World wide web
Multi-style language model for web scale information retrieval

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

Enriching textbooks through data mining

Proceedings of the First ACM Symposium on Computing for Development
Web scale NLP: a case study on url word breaking

Proceedings of the 20th international conference on World wide web
Sampling representative phrase sets for text entry experiments: a procedure and public resource

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Unsupervised query segmentation using clickthrough for information retrieval

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Tulsa: web search for writing assistance

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Review of MSR-Bing web scale speller challenge

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
High-order sequence modeling for language learner error detection

IUNLPBEA '11 Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications
Web-based validation for contextual targeted paraphrasing

MTTG '11 Proceedings of the Workshop on Monolingual Text-To-Text Generation
Micropinion generation: an unsupervised approach to generating ultra-concise summaries of opinions

Proceedings of the 21st international conference on World Wide Web
Data mining for improving textbooks

ACM SIGKDD Explorations Newsletter
Empowering authors to diagnose comprehension burden in textbooks

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
TwiNER: named entity recognition in targeted twitter stream

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Combining implicit and explicit topic representations for result diversification

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
A scalable distributed syntactic, semantic, and lexical language model

Computational Linguistics
Adaptive clustering for coreference resolution with deterministic rules and web-based language models

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
A broad-coverage normalization system for social media language

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
An unsupervised method for author extraction from web pages containing user-generated content

Proceedings of the 21st ACM international conference on Information and knowledge management
Playing by the rules: mining query associations to predict search performance

Proceedings of the sixth ACM international conference on Web search and data mining
What's in a name?: an unsupervised approach to link users across communities

Proceedings of the sixth ACM international conference on Web search and data mining
Computing n-gram statistics in MapReduce

Proceedings of the 16th International Conference on Extending Database Technology
Exploiting hybrid contexts for Tweet segmentation

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Modeling dwell time to predict click-level satisfaction

Proceedings of the 7th ACM international conference on Web search and data mining
Twitter n-gram corpus with demographic metadata

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

This document describes the properties and some applications of the Microsoft Web N-gram corpus. The corpus is designed to have the following characteristics. First, in contrast to static data distribution of previous corpus releases, this N-gram corpus is made publicly available as an XML Web Service so that it can be updated as deemed necessary by the user community to include new words and phrases constantly being added to the Web. Secondly, the corpus makes available various sections of a Web document, specifically, the body, title, and anchor text, as separates models as text contents in these sections are found to possess significantly different statistical properties and therefore are treated as distinct languages from the language modeling point of view. The usages of the corpus are demonstrated here in two NLP tasks: phrase segmentation and word breaking.