Example selection for bootstrapping statistical parsers

Authors:
Mark Steedman;Rebecca Hwa;Stephen Clark;Miles Osborne;Anoop Sarkar;Julia Hockenmaier;Paul Ruhlen;Steven Baker;Jeremiah Crim
Affiliations:
University of Edinburgh;University of Maryland;University of Edinburgh;University of Edinburgh;Simon Fraser University;University of Edinburgh;Johns Hopkins University;Cornell University;Johns Hopkins University
Venue:
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Year:
2003

Citing 16
Cited 15

Improving Generalization with Active Learning

Machine Learning - Special issue on structured connectionist systems
Selective Sampling Using the Query by Committee Algorithm

Machine Learning
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Active Learning for Natural Language Parsing and Information Extraction

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Enhancing Supervised Learning with Unlabeled Data

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Selective Sampling with Redundant Views

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Head-driven statistical models for natural language parsing

Head-driven statistical models for natural language parsing
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
A maximum-entropy-inspired parser

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Minimizing manual annotation cost in supervised training from corpora

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Bootstrapping statistical parsers from small datasets

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Active learning for statistical natural language parsing

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Bootstrapping

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Applying co-training methods to statistical parsing

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Rule writing or annotation: cost-efficient resource usage for base noun phrase chunking

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Sample selection for statistical grammar induction

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13

Sample Selection for Statistical Parsing

Computational Linguistics
Updating an NLP system to fit new domains: an empirical study on the sentence segmentation problem

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Bootstrapping coreference classifiers with multiple machine learning algorithms

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
Multi-criteria-based active learning for named entity recognition

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
A robust multilingual portable phrase chunking system

Expert Systems with Applications: An International Journal
Active learning for logistic regression: an evaluation

Machine Learning
Innovations in Natural Language Document Processing for Requirements Engineering

Innovations for Requirement Analysis. From Stakeholders' Needs to Formal Designs
Porting a lexicalized-grammar parser to the biomedical domain

Journal of Biomedical Informatics
MAP adaptation of stochastic grammars

Computer Speech and Language
Evaluating a statistical CCG parser on Wikipedia

People's Web '09 Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources
HITS-based seed selection and stop list construction for bootstrapping

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Improving Korean verb-verb morphological disambiguation using lexical knowledge from unambiguous unlabeled data and selective web counts

Pattern Recognition Letters
Chinese chunking with tri-training learning

ICCPOL'06 Proceedings of the 21st international conference on Computer Processing of Oriental Languages: beyond the orient: the research challenges ahead
Parsing biomedical literature

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing
A word clustering approach to domain adaptation: effective parsing of biomedical texts

IWPT '11 Proceedings of the 12th International Conference on Parsing Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper investigates bootstrapping for statistical parsers to reduce their reliance on manually annotated training data. We consider both a mostly-unsupervised approach, cotraining, in which two parsers are iteratively re-trained on each other's output; and a semi-supervised approach, corrected co-training, in which a human corrects each parser's output before adding it to the training data. The selection of labeled training examples is an integral part of both frameworks. We propose several selection methods based on the criteria of minimizing errors in the data and maximizing training utility. We show that incorporating the utility criterion into the selection method results in better parsers for both frameworks.