Mitigating the paucity-of-data problem: exploring the effect of training corpus size on classifier performance for natural language processing

Authors:
Michele Banko;Eric Brill
Affiliations:
Microsoft Research, Redmond, WA;Microsoft Research, Redmond, WA
Venue:
HLT '01 Proceedings of the first international conference on Human language technology research
Year:
2001

Citing 8
Cited 28

A Winnow-Based Approach to Context-Sensitive Spelling Correction

Machine Learning - Special issue on natural language learning
Learning to Parse Natural Language with Maximum Entropy Models

Machine Learning - Special issue on natural language learning
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Automatic Rule Acquisition for Spelling Correction

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Exploiting diversity for natural language parsing

Exploiting diversity for natural language parsing
Contextual spelling correction using latent semantic analysis

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Decision lists for lexical ambiguity resolution: application to accent restoration in Spanish and French

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Combining Trigram-based and feature-based methods for context-sensitive spelling correction

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics

Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2
Using the web to obtain frequencies for unseen bigrams

Computational Linguistics - Special issue on web as corpus
Scaling to very very large corpora for natural language disambiguation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Sequential conditional Generalized Iterative Scaling

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Web-based models for natural language processing

ACM Transactions on Speech and Language Processing (TSLP)
An incremental decision list learner

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Using the web to overcome data sparseness

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Towards terascale knowledge acquisition

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Web resources for language modeling in conversational speech recognition

ACM Transactions on Speech and Language Processing (TSLP)
Sentiment Detection Using Lexically-Based Classifiers

TSD '08 Proceedings of the 11th international conference on Text, Speech and Dialogue
A mission for computational natural language learning

CoNLL-X '06 Proceedings of the Tenth Conference on Computational Natural Language Learning
The effect of corpus size on case frame acquisition for discourse analysis

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Web-scale distributional similarity and entity set expansion

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Some of our best friends are statisticians

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
Processing natural language without natural language processing

CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
An overview of Microsoft web N-gram corpus and applications

HLT-DEMO '10 Proceedings of the NAACL HLT 2010 Demonstration Session
Search right and thou shalt find...: using web queries for learner error detection

IUNLPBEA '10 Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications
Cause identification from aviation safety incident reports via weakly supervised semantic lexicon construction

Journal of Artificial Intelligence Research
Web scale NLP: a case study on url word breaking

Proceedings of the 20th international conference on World wide web
The impact of language models and loss functions on repair disfluency detection

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Technical Section: Neural network-based symbol recognition using a few labeled samples

Computers and Graphics
Reduction of maximum entropy models to hidden markov models

UAI'02 Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence
Another look at the data sparsity problem

TSD'06 Proceedings of the 9th international conference on Text, Speech and Dialogue
Text adaptation using formal concept analysis

ICCBR'10 Proceedings of the 18th international conference on Case-Based Reasoning Research and Development
Message classification as a basis for studying command and control communications--an evaluation of machine learning approaches

Journal of Intelligent Information Systems
A scalable distributed syntactic, semantic, and lexical language model

Computational Linguistics
Predicting learner levels for online exercises of Hebrew

Proceedings of the Seventh Workshop on Building Educational Applications Using NLP

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we discuss experiments applying machine learning techniques to the task of confusion set disambiguation, using three orders of magnitude more training data than has previously been used for any disambiguation-in-string-context problem. In an attempt to determine when current learning methods will cease to benefit from additional training data, we analyze residual errors made by learners when issues of sparse data have been significantly mitigated. Finally, in the context of our results, we discuss possible directions for the empirical natural language research community.