Scaling to very very large corpora for natural language disambiguation

Authors:
Michele Banko;Eric Brill
Affiliations:
Microsoft Research, Redmond, WA;Microsoft Research, Redmond, WA
Venue:
ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Year:
2001

Citing 14
Cited 95

Bagging predictors

Machine Learning
Learning to classify text from labeled and unlabeled documents

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
A Winnow-Based Approach to Context-Sensitive Spelling Correction

Machine Learning - Special issue on natural language learning
Automatic Rule Acquisition for Spelling Correction

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Tagging English text with a probabilistic model

Computational Linguistics
A simple approach to building ensembles of Naive Bayesian classifiers for word sense disambiguation

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Contextual spelling correction using latent semantic analysis

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Classifier combination for improved lexical disambiguation

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Improving data driven wordclass tagging by system combination

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Unsupervised word sense disambiguation rivaling supervised methods

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Decision lists for lexical ambiguity resolution: application to accent restoration in Spanish and French

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Combining Trigram-based and feature-based methods for context-sensitive spelling correction

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Mitigating the paucity-of-data problem: exploring the effect of training corpus size on classifier performance for natural language processing

HLT '01 Proceedings of the first international conference on Human language technology research
Tree-bank grammars

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 2

Memory-based shallow parsing

The Journal of Machine Learning Research
Shallow parsing using noisy and non-stationary training material

The Journal of Machine Learning Research
Introduction to the special issue on the web as corpus

Computational Linguistics - Special issue on web as corpus
Using the web to obtain frequencies for unseen bigrams

Computational Linguistics - Special issue on web as corpus
Introduction to the special issue on evaluating word sense disambiguation systems

Natural Language Engineering
Parameter optimization for machine-learning of word sense disambiguation

Natural Language Engineering
Web-scale information extraction in knowitall: (preliminary results)

Proceedings of the 13th international conference on World Wide Web
Word translation disambiguation using bilingual bootstrapping

Computational Linguistics
Weakly-supervised relation classification for information extraction

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Scaling context space

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
An unsupervised approach to recognizing discourse relations

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Shallow parsing on the basis of words only: a case study

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
An empirical study of active learning with support vector machines for Japanese word segmentation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Weakly supervised natural language learning without redundant views

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Offline strategies for online question answering: answering questions before they are asked

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Web-based models for natural language processing

ACM Transactions on Speech and Language Processing (TSLP)
A Network Analysis Model for Disambiguation of Names in Lists

Computational & Mathematical Organization Theory
Sample Selection for Statistical Parsing

Computational Linguistics
Evaluating the results of a memory-based word-expert approach to unrestricted word sense disambiguation

WSD '02 Proceedings of the ACL-02 workshop on Word sense disambiguation: recent successes and future directions - Volume 8
An incremental decision list learner

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Ensemble methods for automatic thesaurus extraction

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Using the web to overcome data sparseness

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Statistical named entity recognizer adaptation

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
A very very large corpus doesn't always yield reliable estimates

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Letter level learning for language independent diacritics restoration

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
An evaluation exercise for word alignment

HLT-NAACL-PARALLEL '03 Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3
Training a naive bayes classifier via the EM algorithm with a class distribution constraint

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Blueprint for a high performance NLP infrastructure

SEALTS '03 Proceedings of the HLT-NAACL 2003 workshop on Software engineering and architecture of language technology systems - Volume 8
Bootstrapping coreference classifiers with multiple machine learning algorithms

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
Using the web as an implicit training set: application to structural ambiguity resolution

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
An exploration of the principles underlying redundancy-based factoid question answering

ACM Transactions on Information Systems (TOIS)
Analysis of selective strategies to build a dependency-analyzed corpus

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Active learning for logistic regression: an evaluation

Machine Learning
Mining relational data from text: From strictly supervised to weakly supervised learning

Information Systems
Exploring hedge identification in biomedical literature

Journal of Biomedical Informatics
Identifying semitic roots: Machine learning with linguistic constraints

Computational Linguistics
Multilingual pronunciation by analogy

Natural Language Engineering
Has Computational Linguistics Become More Applied?

CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Large-scale deep unsupervised learning using graphics processors

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Improving classification accuracy using automatically extracted training data

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Scalable language processing algorithms for the masses: a case study in computing word co-occurrence matrices with MapReduce

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Graph-based analysis of semantic drift in Espresso-like bootstrapping algorithms

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Semi-automatic entity set refinement

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
The effect of corpus size on case frame acquisition for discourse analysis

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Scaling high-order character language models to gigabytes

Software '05 Proceedings of the Workshop on Software
Exploring large-data issues in the curriculum: a case study with MapReduce

TeachCL '08 Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics
Testing the efficacy of part-of-speech information in word completion

TextEntry '03 Proceedings of the 2003 EACL Workshop on Language Modeling for Text Entry Methods
CUCWeb: a Catalan corpus built from the web

WAC '06 Proceedings of the 2nd International Workshop on Web as Corpus
All-word prediction as the ultimate confusable disambiguation

CHSLP '06 Proceedings of the Workshop on Computationally Hard Problems and Joint Inference in Speech and Language Processing
Data selection in semi-supervised learning for name tagging

IEBeyondDoc '06 Proceedings of the Workshop on Information Extraction Beyond The Document
Weakly supervised learning methods for improving the quality of gene name normalization data

ISMB '05 Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics
Label correspondence learning for part-of-speech annotation transformation

Proceedings of the 18th ACM conference on Information and knowledge management
Web-scale N-gram models for lexical disambiguation

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Mining of parsed data to derive deverbal argument structure

GEAF '09 Proceedings of the 2009 Workshop on Grammar Engineering Across Frameworks
Tag confidence measure for semi-automatically updating named entity recognition

NEWS '09 Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration
Language models for contextual error detection and correction

CLAGI '09 Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference
The noisy channel model for unsupervised word sense disambiguation

Computational Linguistics
Scalable learning for object detection with GPU hardware

IROS'09 Proceedings of the 2009 IEEE/RSJ international conference on Intelligent robots and systems
Exploring web scale language models for search query processing

Proceedings of the 19th international conference on World wide web
Some of our best friends are statisticians

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
Constructing a large scale text corpus based on the grid and trustworthiness

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
Processing natural language without natural language processing

CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
Web-scale computer vision using MapReduce for multimedia data mining

Proceedings of the Tenth International Workshop on Multimedia Data Mining
Qme!: a speech-based question-answering system on mobile devices

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Bucking the trend: large-scale cost-focused active learning for statistical machine translation

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Creating robust supervised classifiers via web-scale N-gram data

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Speech-driven access to the deep web on mobile devices

ACLDemos '10 Proceedings of the ACL 2010 System Demonstrations
The design of a proofreading software service

CL&W '10 Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics and Writing: Writing Processes and Authoring Aids
Unsupervised Part-of-Speech Tagging in the Large

Research on Language and Computation
Annotating large email datasets for named entity recognition with Mechanical Turk

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Search right and thou shalt find...: using web queries for learner error detection

IUNLPBEA '10 Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications
Generating confusion sets for context-sensitive error correction

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Using web-scale N-grams to improve base NP parsing performance

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Heterogeneous parsing via collaborative decoding

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Language analytics for assessing brain health: cognitive impairment, depression and pre-symptomatic Alzheimer's disease

BI'10 Proceedings of the 2010 international conference on Brain informatics
Automatic treebank conversion via informed decoding

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
SDDB: a self-dependent and data-based method for constructing bilingual dictionary from the web

APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Automatic Treebank Conversion via Informed Decoding - A Case Study on Chinese Treebanks

ACM Transactions on Asian Language Information Processing (TALIP)
How many multiword expressions do people know?

MWE '11 Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World
Combining labeled and unlabeled data for learning cross-document structural relationships

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Using verbs to characterize noun-noun relations

AIMSA'06 Proceedings of the 12th international conference on Artificial Intelligence: methodology, Systems, and Applications
Instance selection for machine translation using feature decay algorithms

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Annotating text segments using a web-based categorization approach

ICADL'05 Proceedings of the 8th international conference on Asian Digital Libraries: implementing strategies and sharing experiences
Large-scale machine learning at twitter

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
IR-based traceability recovery as a plugin: an industrial case study

FDIA'11 Proceedings of the Fourth BCS-IRSG conference on Future Directions in Information Access
An evaluation of classification models for question topic categorization

Journal of the American Society for Information Science and Technology
Citation-based bootstrapping for large-scale author disambiguation

Journal of the American Society for Information Science and Technology
Improving searcher models using mouse cursor activity

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Automatic parallel fragment extraction from noisy data

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
The UI system in the HOO 2012 shared task on error correction

Proceedings of the Seventh Workshop on Building Educational Applications Using NLP
A unified approach to transliteration-based text input with online spelling correction

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Four methods for supervised word sense disambiguation

NLDB'07 Proceedings of the 12th international conference on Applications of Natural Language to Information Systems
Scaling big data mining infrastructure: the twitter experience

ACM SIGKDD Explorations Newsletter
Mining large streams of user data for personalized recommendations

ACM SIGKDD Explorations Newsletter

Quantified Score

Hi-index	0.00

Visualization

Abstract

The amount of readily available on-line text has reached hundreds of billions of words and continues to grow. Yet for most core natural language tasks, algorithms continue to be optimized, tested and compared after training on corpora consisting of only one million words or less. In this paper, we evaluate the performance of different learning methods on a prototypical natural language disambiguation task, confusion set disambiguation, when trained on orders of magnitude more labeled data than has previously been used. We are fortunate that for this particular application, correctly labeled training data is free. Since this will often not be the case, we examine methods for effectively exploiting very large corpora when labeled data comes at a cost.