Assessing the costs of sampling methods in active learning for annotation

Authors:
Robbie Haertel;Eric Ringger;Kevin Seppi;James Carroll;Peter McClanahan
Affiliations:
Brigham Young University, Provo, UT;Brigham Young University, Provo, UT;Brigham Young University, Provo, UT;Brigham Young University, Provo, UT;Brigham Young University, Provo, UT
Venue:
HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Year:
2008

Citing 6
Cited 6

Query by committee

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
Minimizing manual annotation cost in supervised training from corpora

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Rule writing or annotation: cost-efficient resource usage for base noun phrase chunking

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Sample Selection for Statistical Parsing

Computational Linguistics
Enriching the knowledge sources used in a maximum entropy part-of-speech tagger

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
Active learning for part-of-speech tagging: accelerating corpus annotation

LAW '07 Proceedings of the Linguistic Annotation Workshop

Evaluating automation strategies in language documentation

HLT '09 Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing
Timed annotations: enhancing Muc7 metadata by the time it takes to annotate named entities

ACL-IJCNLP '09 Proceedings of the Third Linguistic Annotation Workshop
Using smaller constituents rather than sentences in active learning for Japanese dependency parsing

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Bucking the trend: large-scale cost-focused active learning for statistical machine translation

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Cost-Sensitive Active Visual Category Learning

International Journal of Computer Vision
Uncertainty-based active learning with instability estimation for text classification

ACM Transactions on Speech and Language Processing (TSLP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Traditional Active Learning (AL) techniques assume that the annotation of each datum costs the same. This is not the case when annotating sequences; some sequences will take longer than others. We show that the AL technique which performs best depends on how cost is measured. Applying an hourly cost model based on the results of an annotation user study, we approximate the amount of time necessary to annotate a given sentence. This model allows us to evaluate the effectiveness of AL sampling methods in terms of time spent in annotation. We acheive a 77% reduction in hours from a random baseline to achieve 96.5% tag accuracy on the Penn Treebank. More significantly, we make the case for measuring cost in assessing AL methods.