A comparison of models for cost-sensitive active learning

Authors:
Katrin Tomanek;Udo Hahn
Affiliations:
Friedrich-Schiller-Universität Jena;Friedrich-Schiller-Universität Jena
Venue:
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Year:
2010

Citing 10
Cited 0

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Minimizing manual annotation cost in supervised training from corpora

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Comparing Rank and Score Combination Methods for Data Fusion in Information Retrieval

Information Retrieval
Proactive learning: cost-sensitive active learning with multiple imperfect oracles

Proceedings of the 17th ACM conference on Information and knowledge management
An analysis of active learning strategies for sequence labeling tasks

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Selective supervision: guiding supervised learning with decision-theoretic active learning

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
A two-stage method for active learning of statistical grammars

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Semi-supervised active learning for sequence labeling

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Investigating the effects of selective sampling on the annotation task

CONLL '05 Proceedings of the Ninth Conference on Computational Natural Language Learning
A cognitive cost model of annotations based on eye-tracking data

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

Quantified Score

Hi-index	0.02

Visualization

Abstract

Active Learning (AL) is a selective sampling strategy which has been shown to be particularly cost-efficient by drastically reducing the amount of training data to be manually annotated. For the annotation of natural language data, cost efficiency is usually measured in terms of the number of tokens to be considered. This measure, assuming uniform costs for all tokens involved, is, from a linguistic perspective at least, intrinsically inadequate and should be replaced by a more adequate cost indicator, viz. the time it takes to manually label selected annotation examples. We here propose three different approaches to incorporate costs into the AL selection mechanism and evaluate them on the Muc7T corpus, an extension of the Muc7 newspaper corpus that contains such annotation time information. Our experiments reveal that using a cost-sensitive version of semi-supervised AL, up to 54% of true annotation time can be saved compared to random selection.