Reducing class imbalance during active learning for named entity annotation

Authors:
Katrin Tomanek;Udo Hahn
Affiliations:
Friedrich-Schiller-Universität Jena, Jena, Germany;Friedrich-Schiller-Universität Jena, Jena, Germany
Venue:
Proceedings of the fifth international conference on Knowledge capture
Year:
2009

Citing 15
Cited 5

A sequential algorithm for training text classifiers

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
A maximum entropy approach to natural language processing

Computational Linguistics
Selective Sampling Using the Query by Committee Algorithm

Machine Learning
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Minimizing manual annotation cost in supervised training from corpora

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Rule writing or annotation: cost-efficient resource usage for base noun phrase chunking

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Sample Selection for Statistical Parsing

Computational Linguistics
The class imbalance problem: A systematic study

Intelligent Data Analysis
Learning on the border: active learning in imbalanced data classification

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
On proper unit selection in active learning: co-selection effects for named entity recognition

HLT '09 Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing
Taking into account the differences between actively and passively acquired data: the case of active learning with support vector machines for imbalanced datasets

NAACL-Short '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Active learning with statistical models

Journal of Artificial Intelligence Research
Active learning for part-of-speech tagging: accelerating corpus annotation

LAW '07 Proceedings of the Linguistic Annotation Workshop
The foundations of cost-sensitive learning

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2

Why label when you can search?: alternatives to active learning for applying human resources to build classification models under extreme class imbalance

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Bringing active learning to life

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Inactive learning?: difficulties employing active learning in practice

ACM SIGKDD Explorations Newsletter
Evaluating the impact of coder errors on active learning

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Generating balanced classifier-independent training samples from unlabeled data

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

In lots of natural language processing tasks, the classes to be dealt with often occur heavily imbalanced in the underlying data set and classifiers trained on such skewed data tend to exhibit poor performance for low-frequency classes. We introduce and compare different approaches to reduce class imbalance by design within the context of active learning (AL). Our goal is to compile more balanced data sets up front during annotation time when AL is used as a strategy to acquire training material. We situate our approach in the context of named entity recognition. Our experiments reveal that we can indeed reduce class imbalance and increase the performance of classifiers on minority classes while preserving a good overall performance in terms of macro F-score.