Creating robust supervised classifiers via web-scale N-gram data

Authors:
Shane Bergsma;Emily Pitler;Dekang Lin
Affiliations:
University of Alberta;University of Pennsylvania;Google, Inc.
Venue:
ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Year:
2010

Citing 23
Cited 10

A Winnow-Based Approach to Context-Sensitive Spelling Correction

Machine Learning - Special issue on natural language learning
Theory of Syntactic Recognition for Natural Languages

Theory of Syntactic Recognition for Natural Languages
Optimizing search engines using clickthrough data

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Introduction to the special issue on the web as corpus

Computational Linguistics - Special issue on web as corpus
Using the web to obtain frequencies for unseen bigrams

Computational Linguistics - Special issue on web as corpus
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Corpus statistics meet the noun compound: some empirical results

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Ordering among premodifiers

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Scaling to very very large corpora for natural language disambiguation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
The order of prenominal adjectives in natural language generation

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Web-based models for natural language processing

ACM Transactions on Speech and Language Processing (TSLP)
Using the web in machine learning for other-anaphora resolution

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
Similarity of Semantic Relations

Computational Linguistics
Improving pronoun resolution using statistics-based semantic compatibility information

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Reranking and self-training for parser adaptation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
LIBLINEAR: A Library for Large Linear Classification

The Journal of Machine Learning Research
Class-based ordering of prenominal modifiers

ENLG '09 Proceedings of the 12th European Workshop on Natural Language Generation
Adapting a lexicalized-grammar parser to contrasting domains

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
The linguistic structure of English web-search queries

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Web-scale N-gram models for lexical disambiguation

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Search engine statistics beyond the n-gram: application to noun compound bracketing

CONLL '05 Proceedings of the Ninth Conference on Computational Natural Language Learning
Developing a robust part-of-speech tagger for biomedical text

PCI'05 Proceedings of the 10th Panhellenic conference on Advances in Informatics

Using large monolingual and bilingual corpora to improve coordination disambiguation

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Exploiting web-derived selectional preference to improve statistical dependency parsing

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Exploiting syntactic and distributional information for spelling correction with web-scale n-gram models

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
NADA: a robust system for non-referential pronoun detection

DAARC'11 Proceedings of the 8th international conference on Anaphora Processing and Applications
Predicting the semantic orientation of terms in E-HowNet

ROCLING '11 Proceedings of the 23rd Conference on Computational Linguistics and Speech Processing
NUS at the HOO 2012 shared task

Proceedings of the Seventh Workshop on Building Educational Applications Using NLP
Coreference semantics from web features

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Detection of implicit citations for sentiment detection

ACL '12 Proceedings of the Workshop on Detecting Structure in Scholarly Discourse
Detection of semantic errors in Arabic texts

Artificial Intelligence
Unsupervised word sense disambiguation with N-gram features

Artificial Intelligence Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we systematically assess the value of using web-scale N-gram data in state-of-the-art supervised NLP classifiers. We compare classifiers that include or exclude features for the counts of various N-grams, where the counts are obtained from a web-scale auxiliary corpus. We show that including N-gram count features can advance the state-of-the-art accuracy on standard data sets for adjective ordering, spelling correction, noun compound bracketing, and verb part-of-speech disambiguation. More importantly, when operating on new domains, or when labeled training data is not plentiful, we show that using web-scale N-gram features is essential for achieving robust performance.