On minimizing training corpus for parser acquisition

Authors:
Rebecca Hwa
Affiliations:
University of Maryland, MD
Venue:
ConLL '01 Proceedings of the 2001 workshop on Computational Natural Language Learning - Volume 7
Year:
2001

Citing 14
Cited 8

A sequential algorithm for training text classifiers

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Improving Generalization with Active Learning

Machine Learning - Special issue on structured connectionist systems
Selective Sampling Using the Query by Committee Algorithm

Machine Learning
Information Retrieval

Information Retrieval
Active Learning for Natural Language Parsing and Information Extraction

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
A Maximum-Entropy-Inspired Parser

A Maximum-Entropy-Inspired Parser
Learning probabilistic lexicalized grammars for natural language processing

Learning probabilistic lexicalized grammars for natural language processing
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Selective sampling for example-based word sense disambiguation

Computational Linguistics
Three generative, lexicalised models for statistical parsing

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Minimizing manual annotation cost in supervised training from corpora

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Inside-outside reestimation from partially bracketed corpora

ACL '92 Proceedings of the 30th annual meeting on Association for Computational Linguistics
Rule writing or annotation: cost-efficient resource usage for base noun phrase chunking

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Sample selection for statistical grammar induction

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13

Active learning for statistical natural language parsing

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Supervised and unsupervised PCFG adaptation to novel domains

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Intricacies of Collins' Parsing Model

Computational Linguistics
Sample Selection for Statistical Parsing

Computational Linguistics
Improving supervised learning performance by using fuzzy clustering method to select training data

Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology - Fuzzy theory and technology with applications
Acquiring word-meaning mappings for natural language interfaces

Journal of Artificial Intelligence Research
MAP adaptation of stochastic grammars

Computer Speech and Language
Initial training data selection for active learning

Proceedings of the 5th International Conference on Ubiquitous Information Management and Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many corpus-based natural language processing systems rely on using large quantities of annotated text as their training examples. Building this kind of resource is an expensive and labor-intensive project. To minimize effort spent on annotating examples that are not helpful the training process, recent research efforts have begun to apply active learning techniques to selectively choose data to be annotated. In this work, we consider selecting training examples with the tree-entropy metric. Our goal is to assess how well this selection technique can be applied for training different types of parsers. We find that tree-entropy can significantly reduce the amount of training annotation for both a history-based parser and an EM-based parser. Moreover, the examples selected for the history-based parser are also good for training the EM-based parser suggesting that the technique is parser independent.