HLT '01 Proceedings of the first international conference on Human language technology research
Efficacy of a constantly adaptive language modeling technique for web-scale applications
ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Exploring web scale language models for search query processing
Proceedings of the 19th international conference on World wide web
Multi-style language model for web scale information retrieval
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Enriching textbooks through data mining
Proceedings of the First ACM Symposium on Computing for Development
Web scale NLP: a case study on url word breaking
Proceedings of the 20th international conference on World wide web
Sampling representative phrase sets for text entry experiments: a procedure and public resource
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Unsupervised query segmentation using clickthrough for information retrieval
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Tulsa: web search for writing assistance
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Review of MSR-Bing web scale speller challenge
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
High-order sequence modeling for language learner error detection
IUNLPBEA '11 Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications
Web-based validation for contextual targeted paraphrasing
MTTG '11 Proceedings of the Workshop on Monolingual Text-To-Text Generation
Micropinion generation: an unsupervised approach to generating ultra-concise summaries of opinions
Proceedings of the 21st international conference on World Wide Web
Data mining for improving textbooks
ACM SIGKDD Explorations Newsletter
Empowering authors to diagnose comprehension burden in textbooks
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
TwiNER: named entity recognition in targeted twitter stream
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Combining implicit and explicit topic representations for result diversification
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
A scalable distributed syntactic, semantic, and lexical language model
Computational Linguistics
SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
A broad-coverage normalization system for social media language
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
An unsupervised method for author extraction from web pages containing user-generated content
Proceedings of the 21st ACM international conference on Information and knowledge management
Playing by the rules: mining query associations to predict search performance
Proceedings of the sixth ACM international conference on Web search and data mining
What's in a name?: an unsupervised approach to link users across communities
Proceedings of the sixth ACM international conference on Web search and data mining
Computing n-gram statistics in MapReduce
Proceedings of the 16th International Conference on Extending Database Technology
Exploiting hybrid contexts for Tweet segmentation
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Modeling dwell time to predict click-level satisfaction
Proceedings of the 7th ACM international conference on Web search and data mining
Twitter n-gram corpus with demographic metadata
Language Resources and Evaluation
Hi-index | 0.00 |
This document describes the properties and some applications of the Microsoft Web N-gram corpus. The corpus is designed to have the following characteristics. First, in contrast to static data distribution of previous corpus releases, this N-gram corpus is made publicly available as an XML Web Service so that it can be updated as deemed necessary by the user community to include new words and phrases constantly being added to the Web. Secondly, the corpus makes available various sections of a Web document, specifically, the body, title, and anchor text, as separates models as text contents in these sections are found to possess significantly different statistical properties and therefore are treated as distinct languages from the language modeling point of view. The usages of the corpus are demonstrated here in two NLP tasks: phrase segmentation and word breaking.