Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Minimizing manual annotation cost in supervised training from corpora
ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Comparing Rank and Score Combination Methods for Data Fusion in Information Retrieval
Information Retrieval
Proactive learning: cost-sensitive active learning with multiple imperfect oracles
Proceedings of the 17th ACM conference on Information and knowledge management
An analysis of active learning strategies for sequence labeling tasks
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Selective supervision: guiding supervised learning with decision-theoretic active learning
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
A two-stage method for active learning of statistical grammars
IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Semi-supervised active learning for sequence labeling
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Investigating the effects of selective sampling on the annotation task
CONLL '05 Proceedings of the Ninth Conference on Computational Natural Language Learning
A cognitive cost model of annotations based on eye-tracking data
ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Hi-index | 0.02 |
Active Learning (AL) is a selective sampling strategy which has been shown to be particularly cost-efficient by drastically reducing the amount of training data to be manually annotated. For the annotation of natural language data, cost efficiency is usually measured in terms of the number of tokens to be considered. This measure, assuming uniform costs for all tokens involved, is, from a linguistic perspective at least, intrinsically inadequate and should be replaced by a more adequate cost indicator, viz. the time it takes to manually label selected annotation examples. We here propose three different approaches to incorporate costs into the AL selection mechanism and evaluate them on the Muc7T corpus, an extension of the Muc7 newspaper corpus that contains such annotation time information. Our experiments reveal that using a cost-sensitive version of semi-supervised AL, up to 54% of true annotation time can be saved compared to random selection.