Learning from inconsistent and unreliable annotators by a Gaussian mixture model and Bayesian information criterion

Authors:
Ping Zhang;Zoran Obradovic
Affiliations:
Center for Data Analytics and Biomedical Informatics, Temple University, Philadelphia, PA;Center for Data Analytics and Biomedical Informatics, Temple University, Philadelphia, PA
Venue:
ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part III
Year:
2011

Citing 8
Cited 0

Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
Exploratory Data Analysis with MATLAB (Computer Science and Data Analysis)

Exploratory Data Analysis with MATLAB (Computer Science and Data Analysis)
Get another label? improving data quality and data mining using multiple, noisy labelers

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning from Multiple Sources

The Journal of Machine Learning Research
Proactive learning: cost-sensitive active learning with multiple imperfect oracles

Proceedings of the 17th ACM conference on Information and knowledge management
Supervised learning from multiple experts: whom to trust when everyone lies a bit

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Efficiently learning the accuracy of labeling sources for selective sampling

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Supervised learning from multiple annotators is an increasingly important problem in machine leaning and data mining. This paper develops a probabilistic approach to this problem when annotators are not only unreliable, but also have varying performance depending on the data. The proposed approach uses a Gaussian mixture model (GMM) and Bayesian information criterion (BIC) to find the fittest model to approximate the distribution of the instances. Then the maximum a posterior (MAP) estimation of the hidden true labels and the maximum-likelihood (ML) estimation of quality of multiple annotators are provided alternately. Experiments on emotional speech classification and CASP9 protein disorder prediction tasks show performance improvement of the proposed approach as compared to the majority voting baseline and a previous data-independent approach. Moreover, the approach also provides more accurate estimates of individual annotators performance for each Gaussian component, thus paving the way for understanding the behaviors of each annotator.