Minimizing manual annotation cost in supervised training from corpora

Authors:
Sean P. Engelson;Ido Dagan
Affiliations:
Bar-Ilan University, Ramat Gan, Israel;Bar-Ilan University, Ramat Gan, Israel
Venue:
ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Year:
1996

Citing 11
Cited 27

Query by committee

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
Information-based objective functions for active data selection

Neural Computation
A sequential algorithm for training text classifiers

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Improving Generalization with Active Learning

Machine Learning - Special issue on structured connectionist systems
Information, Prediction, and Query by Committee

Advances in Neural Information Processing Systems 5, [NIPS Conference]
Structural ambiguity and lexical relations

Computational Linguistics - Special issue on using large corpora: I
Tagging English text with a probabilistic model

Computational Linguistics
A stochastic parts program and noun phrase parser for unrestricted text

ANLC '88 Proceedings of the second conference on Applied natural language processing
Does Baum-Welch re-estimation help taggers?

ANLC '94 Proceedings of the fourth conference on Applied natural language processing
A probabilistic model for text categorization: based on a single random variable with multiple values

ANLC '94 Proceedings of the fourth conference on Applied natural language processing
Towards history-based grammars: using richer models for probabilistic parsing

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics

Automatic construction of semantic lexicons for learning natural language interfaces

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Introduction to the special issue on word sense disambiguation: the state of the art

Computational Linguistics - Special issue on word sense disambiguation
Selective sampling for example-based word sense disambiguation

Computational Linguistics
Example selection for bootstrapping statistical parsers

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Rule writing or annotation: cost-efficient resource usage for base noun phrase chunking

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Sample Selection for Statistical Parsing

Computational Linguistics
Coaxing confidences from an old friend: probabilistic classifications from transformation rule lists

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
Sample selection for statistical grammar induction

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
On minimizing training corpus for parser acquisition

ConLL '01 Proceedings of the 2001 workshop on Computational Natural Language Learning - Volume 7
Active learning for HPSG parse selection

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
A backoff model for bootstrapping resources for non-English languages

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Analysis of selective strategies to build a dependency-analyzed corpus

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Assessing the costs of sampling methods in active learning for annotation

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Accelerating the annotation of sparse named entities by dynamic sentence selection

BioNLP '08 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing
An intrinsic stopping criterion for committee-based active learning

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
Reducing class imbalance during active learning for named entity annotation

Proceedings of the fifth international conference on Knowledge capture
The ups and downs of preposition error detection in ESL writing

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Active learning for the identification of nonliteral language

FigLanguages '07 Proceedings of the Workshop on Computational Approaches to Figurative Language
Native judgments of non-native usage: experiments in preposition error detection

HumanJudge '08 Proceedings of the Workshop on Human Judgements in Computational Linguistics
Efficient annotation with the Jena ANnotation Environment (JANE)

LAW '07 Proceedings of the Linguistic Annotation Workshop
Active learning for part-of-speech tagging: accelerating corpus annotation

LAW '07 Proceedings of the Linguistic Annotation Workshop
On privacy preservation in text and document-based active learning for named entity recognition

Proceedings of the ACM first international workshop on Privacy and anonymity for very large databases
Semi-supervised active learning for sequence labeling

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Centrality Measures from Complex Networks in Active Learning

DS '09 Proceedings of the 12th International Conference on Discovery Science
Parallel active learning: eliminating wait time with minimal staleness

ALNLP '10 Proceedings of the NAACL HLT 2010 Workshop on Active Learning for Natural Language Processing
A comparison of models for cost-sensitive active learning

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Authoring technical documents for effective retrieval

EKAW'10 Proceedings of the 17th international conference on Knowledge engineering and management by the masses

Quantified Score

Hi-index	0.00

Visualization

Abstract

Corpus-based methods for natural language processing often use supervised training, requiring expensive manual annotation of training corpora. This paper investigates methods for reducing annotation cost by sample selection. In this approach, during training the learning program examines many unlabeled examples and selects for labeling (annotation) only those that are most informative at each stage. This avoids redundantly annotating examples that contribute little new information. This paper extends our previous work on committee-based sample selection for probabilistic classifiers. We describe a family of methods for committee-based sample selection, and report experimental results for the task of stochastic part-of-speech tagging. We find that all variants achieve a significant reduction in annotation cost, though their computational efficiency differs. In particular, the simplest method, which has no parameters to tune, gives excellent results. We also show that sample selection yields a significant reduction in the size of the model used by the tagger.