Active learning for part-of-speech tagging: accelerating corpus annotation

Authors:
Eric Ringger;Peter McClanahan;Robbie Haertel;George Busby;Marc Carmen;James Carroll;Kevin Seppi;Deryle Lonsdale
Affiliations:
Brigham Young University, Provo, Utah;Brigham Young University, Provo, Utah;Brigham Young University, Provo, Utah;Brigham Young University, Provo, Utah;Brigham Young University, Provo, Utah;Brigham Young University, Provo, Utah;Brigham Young University, Provo, Utah;Brigham Young University, Provo, Utah
Venue:
LAW '07 Proceedings of the Linguistic Annotation Workshop
Year:
2007

Citing 12
Cited 16

Query by committee

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
A sequential algorithm for training text classifiers: corrigendum and additional data

ACM SIGIR Forum
Selective Sampling Using the Query by Committee Algorithm

Machine Learning
Toward Optimal Active Learning through Sampling Estimation of Error Reduction

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Mixed-initiative development of language processing systems

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Classifier combination for improved lexical disambiguation

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Minimizing manual annotation cost in supervised training from corpora

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Feature-rich part-of-speech tagging with a cyclic dependency network

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Active learning for Hidden Markov Models: objective functions and algorithms

ICML '05 Proceedings of the 22nd international conference on Machine learning
Enriching the knowledge sources used in a maximum entropy part-of-speech tagger

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
Efficient computation of entropy gradient for semi-supervised conditional random fields

NAACL-Short '07 Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

Assessing the costs of sampling methods in active learning for annotation

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Active learning with confidence

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
On proper unit selection in active learning: co-selection effects for named entity recognition

HLT '09 Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing
A web survey on the use of active learning to support annotation of text data

HLT '09 Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing
An intrinsic stopping criterion for committee-based active learning

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
Reducing class imbalance during active learning for named entity annotation

Proceedings of the fifth international conference on Knowledge capture
How to establish a verbal paradigm on the basis of ancient Syriac manuscripts

Semitic '09 Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages
On privacy preservation in text and document-based active learning for named entity recognition

Proceedings of the ACM first international workshop on Privacy and anonymity for very large databases
Complex linguistic annotation --- no easy way out!: a case from Bangla and Hindi POS labeling tasks

ACL-IJCNLP '09 Proceedings of the Third Linguistic Annotation Workshop
Using smaller constituents rather than sentences in active learning for Japanese dependency parsing

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Bringing active learning to life

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Evaluating the impact of coder errors on active learning

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Pointwise prediction for robust, adaptable Japanese morphological analysis

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Uncertainty-based active learning with instability estimation for text classification

ACM Transactions on Speech and Language Processing (TSLP)
Active learning with Amazon Mechanical Turk

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Active learning for coreference resolution

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the construction of a part-of-speech annotated corpus, we are constrained by a fixed budget. A fully annotated corpus is required, but we can afford to label only a subset. We train a Maximum Entropy Markov Model tagger from a labeled subset and automatically tag the remainder. This paper addresses the question of where to focus our manual tagging efforts in order to deliver an annotation of highest quality. In this context, we find that active learning is always helpful. We focus on Query by Uncertainty (QBU) and Query by Committee (QBC) and report on experiments with several baselines and new variations of QBC and QBU, inspired by weaknesses particular to their use in this application. Experiments on English prose and poetry test these approaches and evaluate their robustness. The results allow us to make recommendations for both types of text and raise questions that will lead to further inquiry.