Sample Selection for Statistical Parsing

Authors:
Rebecca Hwa
Affiliations:
-
Venue:
Computational Linguistics
Year:
2004

Citing 27
Cited 27

Elements of information theory

Elements of information theory
Improving Generalization with Active Learning

Machine Learning - Special issue on structured connectionist systems
Natural language parsing as statistical pattern recognition

Natural language parsing as statistical pattern recognition
Selective Sampling Using the Query by Committee Algorithm

Machine Learning
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Information Retrieval

Information Retrieval
Active Learning for Natural Language Parsing and Information Extraction

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Head-driven statistical models for natural language parsing

Head-driven statistical models for natural language parsing
Learning probabilistic lexicalized grammars for natural language processing

Learning probabilistic lexicalized grammars for natural language processing
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Selective sampling for example-based word sense disambiguation

Computational Linguistics
Bagging and boosting a treebank parser

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
A maximum-entropy-inspired parser

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Three generative, lexicalised models for statistical parsing

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Statistical models for unsupervised prepositional phrase attachment

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
An empirical evaluation of Probabilistic Lexicalized Tree Insertion Grammars

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Minimizing manual annotation cost in supervised training from corpora

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Inside-outside reestimation from partially bracketed corpora

ACL '92 Proceedings of the 30th annual meeting on Association for Computational Linguistics
A rule-based approach to prepositional phrase attachment disambiguation

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
Bootstrapping statistical parsers from small datasets

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Scaling to very very large corpora for natural language disambiguation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Active learning for statistical natural language parsing

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Applying co-training methods to statistical parsing

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Example selection for bootstrapping statistical parsers

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Rule writing or annotation: cost-efficient resource usage for base noun phrase chunking

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Sample selection for statistical grammar induction

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
On minimizing training corpus for parser acquisition

ConLL '01 Proceedings of the 2001 workshop on Computational Natural Language Learning - Volume 7

Bootstrapping parsers via syntactic projection across parallel texts

Natural Language Engineering
Analysis of selective strategies to build a dependency-analyzed corpus

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Active learning for logistic regression: an evaluation

Machine Learning
The bootstrapping of the Yarowsky algorithm in real corpora

Information Processing and Management: an International Journal
Adapting svm for data sparseness and imbalance: A case study in information extraction

Natural Language Engineering
Assessing the costs of sampling methods in active learning for annotation

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Proactive learning for building machine translation systems for minority languages

HLT '09 Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing
Sample selection for statistical parsers: cognitively driven algorithms and evaluation measures

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
Reducing class imbalance during active learning for named entity annotation

Proceedings of the fifth international conference on Knowledge capture
Example-based metonymy recognition for proper nouns

EACL '06 Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop
An analysis of active learning strategies for sequence labeling tasks

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Active Zipfian sampling for statistical parser training

NAACL-Short '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
Using language modeling to select useful annotation data

SRWS '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Student Research Workshop and Doctoral Consortium
Using smaller constituents rather than sentences in active learning for Japanese dependency parsing

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Active semi-supervised learning for improving word alignment

ALNLP '10 Proceedings of the NAACL HLT 2010 Workshop on Active Learning for Natural Language Processing
Discriminative sample selection for statistical machine translation

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Improved fully unsupervised parsing with zoomed learning

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Bringing active learning to life

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Products of weighted logic programs

Theory and Practice of Logic Programming
Evaluating the impact of coder errors on active learning

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Effective measures of domain similarity for parsing

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Ask me better questions: active learning queries based on rule induction

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Active learning for dependency parsing using partially annotated sentences

IWPT '11 Proceedings of the 12th International Conference on Parsing Technologies
Semi-supervised dependency parsing using lexical affinities

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Improved parsing and POS tagging using inter-sentence consistency constraints

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
An information-theoretic measure to evaluate parsing difficulty across treebanks

ACM Transactions on Speech and Language Processing (TSLP)
Actively soliciting feedback for query answers in keyword search-based data integration

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Corpus-based statistical parsing relies on using large quantities of annotated text as training examples. Building this kind of resource is expensive and labor-intensive. This work proposes to use sample selection to find helpful training examples and reduce human effort spent on annotating less informative ones. We consider several criteria for predicting whether unlabeled data might be a helpful training example. Experiments are performed across two syntactic learning tasks and within the single task of parsing across two learning models to compare the effect of different predictive criteria. We find that sample selection can significantly reduce the size of annotated training corpora and that uncertainty is a robust predictive criterion that can be easily applied to different learning models.